Security situation awareness algorithm of network information transmission based on big data

Wang, Xianping; Zhou, Zhiyuan

doi:10.1038/s41598-025-24204-3

Download PDF

Article
Open access
Published: 07 November 2025

Security situation awareness algorithm of network information transmission based on big data

Xianping Wang¹ &
Zhiyuan Zhou²

Scientific Reports volume 15, Article number: 39058 (2025) Cite this article

1920 Accesses
Metrics details

Subjects

Abstract

Malicious network attacks are becoming complex and more diverse, and this poses a threat to the effectiveness of the traditional security defense mechanisms. To resolve this, this paper presents a new three-stage algorithm of network security situation awareness that can be used to effectively identify threats in network information transmission using machine learning and big data processing. The first step in our methodology is to create a rich feature vector out of the network traffic flows, including statistical features in both directions, temporal, flow-based, and relational features. The second step applies a hybrid feature selection approach to achieve a greater efficiency and accuracy of the model; it applies a Distributed K-Means (D-KMeans) algorithm to cluster the features and a Mutual Information (MI) analysis to select the most informative, non-redundant set. The last operation is the use of a Distributed K-Nearest Neighbor (D-KNN) model to perform robust and scalable network traffic classification. The algorithm proposed was strictly tested on the CICIDS2017 dataset. The experimental results are improved with 98.91 accuracy, 93.71 precision, 98.95 recall, and 96.00 F-Measure. It is a statistically significant increase of at least 1.2 percent in accuracy (p < 0.05) over other available state-of-the-art approaches. The findings confirm that our solution is an extremely efficient and effective means of identifying security threats within large-scale network environments

Leveraging stacking machine learning models and optimization for improved cyberattack detection

Article Open access 14 May 2025

Optimized extreme learning machines with deep learning for high-performance network traffic classification

Article Open access 26 September 2025

An entropy and machine learning based approach for DDoS attacks detection in software defined networks

Article Open access 06 August 2024

Introduction

With the advancement of network technology and the widespread usage of the Internet, network security has gained significant attention. The current methods used to compromise network security are varied, intricate, and advanced. The complexity and unpredictability of network security are increasing. In this situation, it is imperative to swiftly and precisely comprehend the overall state of network security in order to effectively avoid and resolve these attacks. The current emphasis in research fields is specifically on network security¹. As systems go towards more sharing, complexity, and size, so do network intrusions; they are very intricate and can happen in many situations. Traditionally, network security depends on technologies such as vulnerability scanning and intrusion detection. Nevertheless, these ways have proven unable to solve the ever-changing requirements related to network security². Network security situational awareness technology originated in this particular scenario. Network Security Situation Awareness (NSSA) is a developing field that focuses on the acquisition, comprehension, visualization, and prediction of network entities in order to construct network security systems³. Effective NSSA not only identifies the presence of intrusions, but also tries to provide the whole picture of the security status of the network. When used in the context of this paper, increasing security awareness implies giving an extremely precise and detailed categorization of network threats. This rich output can be directly piped into Security Information and Event Management (SIEM) systems to create focused, actionable alerts, create accurate risk scores, and populate dynamic security dashboards, thus giving security analysts a more accurate and timely view of overall security posture.

Studying the NSSA for multi-step attacks is crucial in enhancing the detection and understanding of such attacks⁴. Because in recent years, network attacks have gradually shown large-scale, coordinated, and multi-stage characteristics. Network attacks are no longer isolated incidents, and multi-step attacks are emerging one after another. For example, the increasingly widespread Zeus botnet and worm attacks are highly concealed, penetrating, and targeted multi-step attacks⁵.

Following a thorough comparison between traditional technology and big data, experts have suggested directing attention towards users and identifying potential risks through the collection of extensive information. According to this approach, it is imperative and advantageous to develop a model for analyzing network security vulnerabilities and implementing specific actions to mitigate them. This involves gathering diverse relevant data and applying statistical principles. Several researchers argue that Internet technology has exerted a significant influence on humanity. The utilization of internet technology can enhance interpersonal communication^6,7.

Our discussion pertains to the vast amount of data encountered, particularly in modern networks such as 5G networks, and potentially in future generations. Dealing with this massive data volume for the identification and detection of security threats necessitates efficient techniques for data parsing and analysis. On one hand, employing a centralized model for network attack detection may not be a viable approach, as any flaw in this monitoring model could lead to system-wide issues, hindering security threat detection operations. Therefore, in this research, we endeavored to address these challenges by utilizing big data processing techniques and distributed computing methods.

The first contribution of the paper is a new, holistic framework that overcomes these challenges by a synergistic integration of three major steps. To begin with, we use a hybrid feature engineering paradigm that combines temporal, flow-based, and relational features to construct a rich and comprehensive characterization of network traffic. Second, we present a new two-step distributed feature selection algorithm where the Distributed K-Means (D-KMeans) algorithm is applied to group correlated features and then an iterative procedure based on the Mutual Information (MI) is applied to select the most informative and least redundant features within the groups. Lastly, the large-scale data are processed using a scalable Distributed K-Nearest Neighbor (D-KNN) classifier with efficient operation and high detection accuracy. The combination of feature generation with distributed classification of this pipeline is an original and effective method of network security analysis. The proposed model can also be implemented with the help of this architecture in two stages: offline and online. The offline phase permits the full implementation of the trained model, whereas in the online phase, the trained model is applied to detect the threats. The following is what this paper will contribute:

The development and application of a new two-stage feature selection algorithm that integrates the D-KMeans clustering algorithm with the iterative MI analysis in order to develop an optimal feature set that is as relevant as possible and as free of redundancy as possible.
The deployment and evaluation of an end-to-end security framework that is scalable and possesses the unique combination of hybrid feature engineering, our novel feature selection method, and a D-KNN classifier to enable high-accuracy threat detection on big data systems.
The thorough experimental verification of the CICIDS2017 and UNSW-NB15 datasets proves that the proposed framework outperforms existing ones, therefore, validating its effectiveness.

This paragraph continues the paper. The second part of our analysis focused on related studies. The third section described the materials and techniques. The fourth portion evaluates the findings, and the fifth section provides a conclusion.

Related works

In this part of the paper, we have discussed the work that has been done in this area over the past few years. Chen et al.³ have presented a new concept of network security situation prediction modeling based on a Long Short-Term Memory (LSTM) neural network and the stochastic gradient descent method. Simulation results indicated that this model is more accurate and intuitive.

Dong et al.⁸ proposed a model called the Sigmoid weighted reinforcement mechanism to enhance the accuracy in forecasting a situation of security in a network⁸. They applied cuckoo search for mining temporal correlations for the purpose of dealing with gradient difficulties and optimizing super parameters, which resulted it have faster convergence speed with reduced mistakes in consequence. Li et al.⁹ presented a network security prediction framework that utilized LSTM, Cross-Entropy function, Rectified Linear Unit (ReLU), and layer stacking. The framework was evaluated on KDD CUP 99 and UNSW-NB15 datasets to assess practical improvements. Diao¹⁰ utilized an enhanced LSTM neural network to analyze and process Network Security Situation Forecast (NSSF) data, enhancing convergence speed and reducing prediction error.

Fan et al.¹¹ proposed a situation awareness approach of ResMLP (Multi-Layer Perceptron) and LSTM networks for cybersecurity oriented to analysis on cyber-attack behavior, reducing computational complexity, and improving the efficiency of detection. Experimental results reveal the accuracy of this method. Lin et al.¹² explored network security situational awareness using deep learning technology, highlighting the need for active network security management, despite the model’s limited accuracy in regional security situations. Zhang and Liu¹³ have worked out situational assessment and security posture prediction in network security issues, having solved complicated problems with heterogeneous data structures and unknown threat characteristics. Thereby, the improved model was optimal for network posture value and rank assessment, with the fit being over 85%.

Chen et al.¹⁴ proposed a network security situational awareness framework using the Boltzmann machine, integrating technologies and automation tools to answer security analysts’ questions and build attack scenarios. Jiang and Wan¹⁵ introduced a network learning algorithm-based method for computer network information security risk assessment, improving accuracy by calculating risk, extracting performance characteristics, identifying risk factors, drawing conclusions, and optimizing assessment mode. Yun et al.¹⁶ introduced a machine learning method for analyzing network communication data, improving security, reducing interference, and achieving a 9 to 5% security level, surpassing online monitoring methods.

Jose and Deepa¹⁷ examined the effectiveness of deep learning algorithms in intrusion detection using the CIC-IDS 2017 dataset, emphasizing the necessity of artificial intelligence in devices with resource constraints, and compared deep neural networks, convolutional neural networks, and long short-term memory. Henry et al.¹⁸ proposed a Convolutional Neural Network and Gated Recurrent Unit (CNN-GRU) technique for network parameter optimization, achieving significant improvements in networking attack detection to approximately 98.73% at a False Positive Rate (FPR) rate of 0.075 in simulations carried out on the CICIDS-2017 benchmark dataset. Ho et al.¹⁹ proposed an intrusion detection system using a convolutional neural network for network traffic classification to classify the traffic flow into either benign or malicious for enhancing internet security.

Figueiredo et al.²⁰ presented a stacked-LSTM deep learning model for anomaly-based Network IDS and achieved an accuracy of 99% on the dataset, CICIDS2017; they were able to detect some forms of synthetic network attacks. Luo²¹ created a network security evaluation model using machine learning algorithms, enhancing classification accuracy with an improved Artificial Fish-Swarm Algorithm (AFSA) and Twin support vector machines (TWSVM) model, achieving 90% accuracy. Wang et al.²² introduced a network topology sensing method that utilized trusted agents and Group Decision Making (GDM) policies, enhancing coverage rates and worker agent selection, potentially aiding security analysts in detecting complex network environments. Wang et al.²³ proposed a novel prediction approach via the support vector machine method for assessing network security, where the indicators are analyzed and quantified for their performance against threats from Internet of Things (IoT) and big data.

Some of the studies are aimed at developing specific models that suit certain network environments. As an example, Alzubi²⁴ suggested an Industrial Wireless Sensor Networks (IWSN) model that integrates the Frechet Hyperbolic Deep Traffic (FHDT) approach to feature extraction and Statistical Dirichlet intrusion detection. On the same note, Yadawad et al.²⁵ presented an attack detection framework in IoT settings with a Colour Harmony Algorithm to select features and an Auto-metric Graph Neural Network (AGNN) to classify. Though these other works also mix feature selection/extraction with a classifier, our approach is different. We use a unique combination of statistical, temporal, and flow-based features and introduce a hybrid D-KMeans and MI-based feature selection method that is specifically designed to reduce redundancy and be efficient in large-scale, general-purpose networks. In addition, we use a D-KNN classifier that is unique to the big data problems that are not explicitly addressed by AGNN or Dirichlet-based models.

In contrast to our supervised method, other recent works have explored the potential of unsupervised learning, which is applicable in situations where labeled data are scarce. Kabilan et al.²⁶ suggested an unsupervised Intrusion Detection System (IDS) of in-vehicle networks with autoencoders and Fuzzy C-Means (FCM) clustering. In this direction, Fernando et al.²⁷ compared the performance of a number of unsupervised algorithms, including the K-means + + and Isolation Forest, in intrusion detection tasks in unbalanced data settings. These articles show that unsupervised techniques are possible in specific contexts (e.g., in-vehicle networks) or data (e.g., imbalance). Our work extends this work by demonstrating that a robust, supervised framework can be used, with a large, labeled dataset (CICIDS2017), to achieve high performance by developing a highly accurate and granular detection model. Table 1 lists the specifications of the studied works.

Table 1 Summary of the studied papers.

Full size table

The above comparative analysis shows that there are a number of prevailing trends and research gaps in the area of network security situation awareness. Much research has been done on new classification models, especially deep learning models such as LSTM and CNNs^3,11,18,20, to obtain high accuracy. However, this focus has the disadvantage of ignoring two key aspects of real-world deployment: explicit feature selection and big-data scalability. Most methods are based on the implicit feature extraction of deep learning models, which is computationally costly and is not interpretable, or do not focus on the problem of distributed processing of large data volumes.

Our proposed framework directly addresses these gaps. First, unlike the methods where feature selection is regarded as a secondary problem, our new hybrid methodology that uses Distributed K-Means (D-KMeans) clustering and Mutual Information (MI) provides an intuitive and efficient mechanism of selecting the most informative, yet not redundant features. This is necessary in reducing the computation overhead and improving the performance of a model in a big data situation. Second, other studies based on large datasets such as CICIDS2017^18,20 do not offer an end-to-end distributed architecture. It is basically a scalable model, which employs a distributed KNN (D-KNN) classifier and D-KMeans algorithm. In that regard, the primary innovation of our work is the fact that our work is holistic and pragmatic in nature, which synergistically balances high accuracy with the computational efficiency and scalability required by modern Security Operations Centers (SOCs), filling a gap between the performance of theoretical models and the realistic operational applicability.

Research methodology

In this section, after describing the characteristics of the dataset used in the research, the steps of the proposed method for secure information transfer in computer networks using big data processing techniques are explained.

Database

The present study, based on the CICIDS2017 dataset²⁸, has been designed and implemented. Unlike most common datasets for network security purposes, this dataset describes instances in the form of network traffic flows. The inclusion of unprocessed traffic flows in this dataset has led to a highly realistic simulation of network performance in the real world. Thus, by designing and evaluating the proposed method based on the data provided in this dataset, we can ensure its performance accuracy in real-world scenarios.

In addition to simulating real-world conditions, the strengths of this dataset lie in its diverse coverage of communication protocols and the inclusion of a wide range of new security threats and attacks in computer networks, making it a suitable choice for the current research. The experimental network includes a wide range of devices such as routers, switches, firewalls, and modems, and user/attacker machines running operating systems such as Mac OS X, Ubuntu, and Windows. In this dataset, traffic flows are stored in PCAP format, and the activities of 25 users are covered based on protocols such as HTTP, HTTPS, FTP, SSH, and E-Mail. The traffic flows in this dataset were recorded over a 5-day period (from July 3, 2017, to July 7, 2017). The recorded flows on the first day include only normal (non-attack) traffic. The total number of normal instances in this dataset is 2,273,097. In the subsequent days, with the activation of a network of attackers consisting of 4 systems, various types of attack-related traffic flows were recorded. These attacks can be categorized as follows: Scan (158,930 instances, including Firewall Rule on and Firewall rules off), Bot (1,966 instances, including various Botnet ARES types), Heart-bleed (11 instances, including Heartbleed Port 444), Infiltrations (36 instances, including Dropbox download and Cool disk), DDoS (128,027 instances), and DoS (252,661 instances, including slowloris, Slowhttptest, Hulk, GoldenEye, and LOIT), Brute-Force (13,835 instances, including FTP-Patator and SSH-Patator), and Web-Based (2180 instances, including Web Attack – Brute Force, XSS, and SQL Injection). Further details of this dataset are elaborated in²⁸.

Proposed methodology

Given the substantial volume of data described in the previous section, processing this data to form a robust detection model necessitates the utilization of big data processing techniques, which have been investigated in this study. This section focuses on explaining the process of analyzing network traffic flows for secure data transmission using machine learning techniques and big data processing. The proposed method includes the following key steps:

1.
Describing the features of traffic flows.
2.
Feature selection based on clustering and MI analysis.
3.
Distributed machine learning-based classification of traffic flows.

The computational stages of the proposed method are presented in Fig. 1.

As depicted in Fig. 1, the proposed framework comprises a combination of labeled traffic flow data (training set) and unlabeled data (test set), along with modules for feature construction, feature selection, and classification. In the first stage, each database instance is described by a feature vector composed of statistical features extracted from its traffic flow. Following the description of network traffic flow features through statistical and relational features, a feature selection strategy based on clustering is employed to reduce the dimensionality of the data. The goal of this step is to eliminate features unrelated to security threats in network flows, thereby enhancing detection accuracy while reducing hardware requirements for processing big data. To achieve this, in the feature selection step, the indicators extracted from traffic flows are first clustered using the Distributed KMeans algorithm (D-Kmeans). After this process, the data features are organized into clusters.

Next, MI is utilized to rank the indicators based on their relationship with the target variable. Subsequently, an iterative approach is employed to determine the set of relevant indicators based on the MI values for each category. The outcome of this process is a subset of descriptive traffic flow indicators that provide the highest level of useful information regarding security threats in the flow. In the third step of the proposed method, the selected indicators are used for classifying traffic flows. For this purpose, a D-KNN model is employed. This model can achieve high detection accuracy while meeting computational requirements for processing big data. After training the D-KNN model based on the training instances, it will be used to detect security threats in new traffic flows.

As shown in Fig. 1, the proposed architecture is executable in two phases: offline and online. The offline phase involves processes such as identifying the relevant indicator list and constructing the distributed learning model based on the training data. In the online (testing) phase, the trained model observes a set of unlabeled traffic flows, and after extracting relevant indicators from it, these indicator sets are analyzed by the trained model to identify security threats in the traffic flows.

In order to put our approach into perspective and explain its limitations of operation, we specify the following threat model and assumptions.

Threat model

The system will identify a network-based attacker who intends to breach the confidentiality, integrity, or availability of the network and the systems that are connected to it. We suppose that the attacker is working by creating malicious traffic flows that can be differentiated from benign traffic. The range of threats that can be detected is clearly matched with the contemporary attack vectors that are reflected in the CICIDS2017 dataset. The actions of the adversary are thus supposed to be categorized as the following:

Denial-of-Service (DoS/DDoS): Attacks that target to overwhelm server or network resources (e.g., Hulk, GoldenEye, Slowloris, Slowhttptest).
Brute-Force: Unauthorized access to services such as FTP and SSH, and this is done through automation.
Web-Based Attacks: Web application-based attacks like Cross-Site Scripting (XSS) and SQL Injection.
Scanning and Probing: Gathering of information to locate open ports and open services.
Infiltration and Botnets: Malware-based operations, which involve the creation of command and control connections and data theft.

We take as the baseline that these malicious actions produce measurable network flow anomalies (e.g., packet sizes, inter-arrival times, flow duration) that are detectable through our feature set.

Key assumptions

The proposed algorithm is evaluated in terms of performance and validity, assuming the following assumptions:

Labeled Data: At the offline training phase, we assume that we have an adequate and correctly labeled corpus of benign and malicious flows of network traffic flows available. This is because the performance of the supervised D-KNN model is determined by the quality of these labels.
Representativeness of Features: It is supposed that the set of engineered features is dense enough to capture the distinguishing features of malicious and normal network activity. We assume that the temporal, flow, and relational characteristics together give the required information to achieve the successful classification.
Environmental Consistency: We suppose that the network on which the model will be deployed has a traffic pattern (e.g., protocols, services) that is statistically close to the CICIDS2017 training environment. Major variations, or concept drift, may possibly impair the accuracy of detection.
Known Threat Patterns: The model will detect attacks that have patterns that are similar to the training set. It is not necessarily aimed at detecting entirely new so-called zero-day attacks that are statistically unrelated to any of the known types of attacks.

Describing traffic flow features

Raw data of the CICIDS2017 were subjected to a necessary preprocessing stage before feature extraction in order to guarantee the quality of data and model stability. Data cleaning was the initial step, and we managed to remove records that had missing (NaN) or infinite values, as this may interfere with the training of machine learning algorithms. All the numerical features in the data set were subjected to the Min–Max normalization method after cleaning. This is a normalization technique that normalizes each feature to a fixed range of [0, + 1], which is crucial to distance-based algorithms such as KNN. It makes sure that those features that have larger numeric ranges do not have a disproportionate impact on the performance of the model. Normalization of all the values X of a feature:

$${X}_{normalized} =\frac{X - {X}_{min}}{{X}_{max} - {X}_{min}}$$

(1)

with ${X}_{min}$ and ${X}_{max}$ represent the minimum and maximum of that attribute for all the records. This cleaned data was subsequently used in the subsequent feature engineering and selection.

The next step in the proposed approach involves describing each traffic flow using a fixed-length feature vector. In the proposed method, a labeled dataset of traffic flows is considered as the training data. To define the feature vector for each instance, statistical features of these instances are extracted in both directions of the flow, i.e., forward (client to server) and backward (server to client). This two-way analysis is important because network communications, particularly the malicious attacks, tend to be asymmetrical in nature. As an example, a DoS attack could have a large amount of requests going forward and little traffic going backward. The perspective of our model will be more detailed as it will calculate the statistical properties of each direction individually and, thus, will be capable of distinguishing between malicious and benign flows better. The proposed method utilizes two statistical groups of features and a set of relational features to describe a traffic flow:

Temporal features These features describe the statistical features of a traffic flow based on specific time points within the flow. These statistical features can define features related to a specific traffic flow instance and are extracted from randomly chosen time points along the connection. In these features, the time point for extracting feature k is determined as a random number with an exponential distribution and a mean of ${2}^{\text{k}}$ (seconds). For example, if k is set to 10, 10 random time points are selected with an exponential distribution and mean values ranging from ${2}^{1}$ (sec) to ${2}^{10}$ (sec). Subsequently, statistical features are extracted from these random time points. We refer to this group as “temporal features.”
Flow features These features are computed based on the overall features of a flow. They can also be useful in detecting network traffic types. Notably, these features do not require the flow to be completed. They represent statistical features up to the current time (unlike time-based features, which depend on specific time points). We refer to this group as “flow features.”
Relational features of the flow This set of features describes technical and communication specifications of the flow, such as source and destination port numbers, control flag counts, data volume-related features, etc. These features are extracted using the CICFlowMeter tool²⁹. In total, 80 features are employed to describe traffic flows.

The set of features used for constructing feature vectors in the proposed method is detailed in Table 2.

Table 2 Statistical features employed in the proposed method.

Full size table

Feature selection based on clustering and MI analysis

The second step in the proposed approach involves selecting an optimal subset of indicators relevant to security threats in network traffic flows. This is achieved by combining clustering techniques and MI analysis. The proposed feature selection strategy includes the following stages:

1.
Clustering of Training Instances In this stage, each indicator serves as a clustering criterion for categorizing the instances. The clustering algorithm used is a D-Kmeans, which adapts it to the computational requirements for large-scale data classification.
2.
Ranking Clustered Indicators After clustering, the importance of each clustered indicator is evaluated using MI to assess its relationship with security threat types.
3.
Final indicator Selection Finally, a forward sequential feature selection strategy is employed to choose indicators relevant to security threats. This strategy explores the problem space and considers the structure of the clustered indicators, aiming to maximize prediction accuracy.

The proposed feature selection process begins with clustering the indicators. Given the large volume of training data, the chosen clustering algorithm must be capable of handling big data. For N datasets, the initial K-Means algorithm with T iterations has a computational complexity of $O(NKT)$, which is not suitable for big data processing applications. Therefore, in the proposed method, a DK-Means model³⁰ is used for indicator clustering. In order to make our results replicable, we specified the parameters of the D-KMeans algorithm. K was empirically determined to be 15. This value was chosen due to the fact that it provided a reasonable clustering of features that are related in regards to functionality and did not result in either too granular or too broad clusters. The algorithm convergence is determined either by a maximum number of iterations, T = 100, or by the difference between the cluster centers of two successive iterations being less than a threshold of 0.001. This clustering algorithm leverages the Map-Reduce model for data categorization. In this computational model, the data is divided into N subsets, and the process of identifying cluster centers within each subset is performed independently. Subsequently, the identified centers for each data subset are merged, and after updating these centers, the final classification is determined. The steps of indicator clustering using the DK-Means algorithm are as follows³¹:

1.
Clustering of indicators In this step, the set of indicators is divided into n subsets.
2.
Random Initialization of K Centers For each data subset, an initial set of K centers is randomly generated.
3.
Computing Distances In each data subset, the distance between each instance and each of the K existing centers is calculated. In this step, the Euclidean distance metric is used.
4.
Data Reassignment Data points are reassigned to the nearest cluster until all data points are processed across all subsets.
5.
Collecting Data Records Data records are collected and processed in the form of data points and K clusters.
6.
Calculating New Centers The newly identified centers for each data subset are merged, and after adjusting these centers, the final classification is determined.
7.
Iterative Refinement The process is repeated from step 3 until the algorithm reaches either the T iteration threshold or the centers’ difference falls below the ϵ threshold in two consecutive iterations.

By executing the above steps, the indicators are placed into K clusters. Indicators within the same cluster exhibit greater similarity compared to indicators in other clusters. Consequently, some combinations of clustered indicators may lead to redundancy and provide repetitive information to the classification model. To address this issue, the proposed feature selection strategy employs MI ranking of indicators. Thus, the MI between each indicator, e.g., X, and the target variable Y is computed using Eq. (2)³²:

$$MI\left(X,Y\right)=\sum_{x\in X}\sum_{y\in Y}p\left(x,y\right)\times log\left[\frac{p\left(x,y\right)}{p\left(x\right)\times p\left(y\right)}\right]$$

(2)

where the probability distribution for x is represented as $p\left(x\right)=\text{Pr}\left\{X=x\right\},x\in X$. Based on this equation, indicators reflect useful information regarding patterns of variations in the target variable through higher MI values. Consequently, by selecting indicators with higher MI values, we can identify a relevant set of indicators. However, it is essential to minimize data redundancy. In other words, the selected indicators should not only provide useful information but also minimize the level of repetitive information available to the classifier. In the proposed method, to filter out indicators with redundancy, we utilize the results of clustering. As mentioned earlier, indicators within each cluster exhibit greater similarity compared to indicators in other clusters, which may increase the probability of redundancy.

Finally, we integrate the MI ranking with the cluster structure in a forward sequential selection method that is iterative to determine the optimal subset of features. It is an adaptive method that does not imply a specific MI threshold, but builds the set of features based on the classification performance. It functions in the following way:

1.
Ranking in Clusters The features in each of the K = 15 clusters are then ranked with the target variable (the type of attack) by MI in descending order.
2.
Iterative Selection and Evaluation The selection starts with the feature set empty. The algorithm takes into consideration the most important, but not chosen, feature of each cluster at every iteration. In order to determine the utility of each of these candidate features, a D-KNN classification model is trained and its performance is measured with the help of a fivefold cross-validation scheme.
3.
Decision and Stopping Criterion The feature of the candidate that causes the greatest reduction in the mean cross-validation error is permanently appended to the best feature set. This is repeated, and the clusters are cycled through so that a diverse and non-redundant set of features is selected. The whole selection procedure ends when none of the further feature additions of any cluster leads to a further decrease (i.e., reduction) of the cross-validation error of the model.

The resulting set of features is not only highly informative (high MI) but also as non-redundant as possible (by sampling features across clusters), which has a direct benefit to the predictive performance and computational performance of the model.

Distributed traffic flow classification based on machine learning

Having found a good subset of features, the last stage of our framework is to classify traffic flows. In this task, we chose the KNN algorithm. Although other models, such as Random Forest (RF) or SVM, can be potent, KNN was selected due to some important reasons. To begin with, its non-parametric character is very beneficial since it does not assume any underlying distribution of the data, which in many cases is complex and unpredictable in network traffic. Secondly, KNN has the reputation of being simple and effective at capturing complex, non-linear patterns in the feature space, which can often result in high classification accuracy.

We admit that the main disadvantage of the standard KNN is that it is computationally inefficient when applied to large datasets. Nevertheless, this challenge is directly alleviated by our proposed methodology in two aspects: (1) the above feature selection step dramatically decreases the dimensionality of the data, and (2) we apply KNN in a distributed manner (D-KNN). We posit that in the case of critical security applications, the promise of better accuracy that KNN can provide when scaled by this distributed architecture is an attractive trade-off.

In order to make our framework scalable and process big data, we used MATLAB R2020a to implement our framework. In the case of the distributed algorithms (D-KMeans and D-KNN), MATLAB was set to communicate with an Apache Spark-enabled cluster directly. By integrating it, we were able to utilize the powerful distributed computing engine provided by Spark, namely, using its Resilient Distributed Datasets (RDDs) to partition our data and its MLlib library to perform the underlying machine learning tasks. The cluster was set up with 1 master node and 4 worker nodes with 8 cores and 16 GB of RAM each, and the data was divided into 100 partitions to make full use of parallel processing. By doing so, the computationally intensive distance calculations, which are in synchronization with MATLAB, can be run in parallel on the cluster, and the total processing time is reduced by several orders of magnitude.

The D-KNN algorithm is modeled on the basis of the chosen features of the training instances and is applied in the process of determining the security threats in the network traffic flows during the prediction step. The algorithm is founded on the MapReduce model suggested in³³ that includes two significant steps:

1.
Map Phase The training data is divided into the worker nodes. In every new test case that should be classified, the case is broadcast among all the nodes. Distances between the test instance and the local subset of training instances are calculated at each node, and each node calculates its local set of K-nearest neighbors.
2.
Reduce Phase All the lists of local K-nearest neighbors of each worker node are gathered by the master node. These lists are then merged by the master node, and lastly, the global top K-nearest neighbors are identified amongst all the candidates. Lastly, the test instance is assigned the final class label by a majority vote of such global neighbors.

Not only does this distributed solution improve the performance of the computations, but it also makes the system more robust since the failure of any worker node does not stop the whole classification process. Details of this classification model are given in³³.

Implementation results

The proposed approach was implemented in MATLAB 2020a, and its performance was evaluated using Cross Validation (CV) with 10 repetitions. In the reviewed results, two different scenarios were evaluated: the “Proposed” scenario, detailed in section three, and the "Prop. (All Features)" scenario, where feature selection by DK-Means and MI is disregarded, and predictions are made based on all features. Additionally, comparisons were made with the articles by Jose et al.¹⁷, Henry et al.¹⁸, Ho et al.¹⁹, and Figueiredo et al.²⁰ discussed in the second section of the paper. When considering attacks as a positive target category, the classification result for each sample will fall into one of the following scenarios:

True Positive (TP): Model correctly classifies the positive cases.

False Negative (FN): The model misclassifies the negative cases.

False Positive (FP): Model incorrectly labels instances as positive.

True Negative (TN): Instances rightly detected as negative by the model.

Accuracy is the ratio of the number of correctly classified instances (T) to the total instances (N). It is calculated using:

$$Accuracy= \frac{T}{N}$$

(3)

Precision is the performance statistic that measures correct positive projections. This metric provides the accuracy of the minority class by dividing the number of correctly projected positive instances by the total of projected positive instances.

$$Precision=\frac{TP}{TP+FP}$$

(4)

Recall means recall is the fraction of correct positive predictions out of the total positive predictions; that is, it differs from precision in including the missed positive predictions. It is computed with:

$$Recall=\frac{TP}{FN+TP}$$

(5)

It should be noted that since the research problem is a multi-class classification problem, the precision and recall metrics were calculated for each target class separately, considering that class as positive, while considering the rest of the classes as negative. Then the average of precision and recall for all classes was calculated. FP and FN should be zero for a successful classifier to have one precision and recall. Thus, we require a precision-recall statistic. F1-Measure considers precision and recall and is defined as:

$$F-Measure=\frac{2*Precision*recall}{Precision+recall}$$

(6)

Figure 2 shows a comparison between the average accuracy of the proposed method and the comparative approaches. The proposed method, with an increase of 1.2% compared to the method of Figueiredo et al.²⁰, was able to increase the accuracy significantly, and also compared to the method of Prop. (All Features), reach an accuracy of 9.2 percent. These results show the ability of the proposed algorithm in identifying security threats and safe transmission of information in the network.

Figure 3 depicts the confusion matrix, with classes displayed in columns and model predictions in rows. The analysis of the confusion matrix indicates that the suggested technique exhibits a higher degree of accuracy in differentiating secure information transmission. In addition, the method employed by Figueiredo et al.²⁰ produces praiseworthy and valuable results. The findings suggest a significant improvement in the secure transmission of information produced by the suggested technique.

Figure 4 shows the precision, recall, and F-Measure criteria between the proposed method and comparative approaches. The class precision metric in secure data transmission designates a ratio of correctly identified data to total data identified by the model. This metric informs us about how much of the data identified by the model as secure fits this identification in the area of information security. Recall in secure data transmission measures the ratio of correctly identified data in relation to all positive data available. On the other hand, the F-Measure metric in secure data transmission is a cumulative one, developed to measure an amalgamation of precision and recall. This metric shows the balance between accuracy and the correct identification of data in security models. As this is a comprehensive, high-level F-Measure metric, it will let us evaluate comprehensively the improvements that go into the detection of security threats, improving overall system performance.

Figure 5 shows the precision measure. According to this figure, the results show that the proposed method has a precision of 93.70% and the comparative method of Figueiredo et al.²⁰ has obtained 88.90% precision. These results show that the proposed method has an effective performance in security and information in data transmission in computer networks.

Figure 6 illustrates the recall metric. From these results, it was obtained that recall for the proposed algorithm was 98.94%, while for the comparative algorithm of Figueiredo et al.²⁰, it amounted to the value of 97.70%. This motivates that a brilliant performance with regard to the recall metric was achieved for the proposed algorithm, and somehow much better in comparison to the metric.

Figure 7 shows the F-Measure. Based on the results, the proposed method with an F-Measure value of 95.99% and a comparative method by Figueiredo et al.²⁰ with an F-Measure value of 92.29% have been obtained. These results show that the proposed method has performed well in F-Measure and can perform well in secure information transmission.

The performance by the proposed method was represented using the Receiver Operating Characteristic (ROC) curve in Fig. 8. The ROC curves were obtained by considering the benign class as negative and other attack-related classes as positive. The efficiency of a model in recognizing attacks can be evident through minimizing the False Positive Rate (FPR) while at the same time maximizing the True Positive Rate (TPR) in the ROC curve. This ROC curve helps in establishing the effectiveness of the model in security measures within the realm of secure data transmission and raising awareness of the security of data transfer in computer networks for exact identification and differentiation between safe and unsafe data transmissions.

Table 3 shows the performance comparison of the proposed method and comparative methods. Besides CICDS2017, in this table, the performance of the models has also been compared with the UNSW-NB15³⁴ dataset. This additional experiment was performed to analyze the generalizability of the proposed model across different network environments and attack types. This dataset covers 9 attacks along with normal network traffic; however, since the distribution of target classes in the UNSW-NB15 dataset has a very high asymmetry (from 2.2 million normal samples in the majority category to 174 samples in the minority category), an attempt has been made to reduce this asymmetry in the testing process by reducing the majority samples.

Table 3 An evaluation of the proposed method and other methods for performance comparison in CICIDS2017 and UNSW-NB15 datasets.

Full size table

Precision, recall, accuracy, F-measure, and area under the ROC curve (AUC) scores are provided in Table 3. As shown in this table, the various performance metrics for the proposed technique are impressive: its precision is 93.71%, its recall rate is 98.95%, its F-Measure is 95.99%, and its accuracy is 98.91% on the CICIDS2017 dataset. Moreover, in the UNSW-NB15 dataset, the suggested model remains very high with an accuracy of 98.51% and the f-measure rate of 94.61%. These results clearly show the salient outperformance of the approach against other methods, such as that by Figueiredo et al.²⁰, which is good but gives less improvement on the comparison criterion in both datasets. The high average results on two modern sets of data ensure the stability and generalizability of the suggested approach.

To obtain such a finding rigorously, we have performed a statistical test to determine whether the increase in performance is statistically significant. A paired t-test was performed between our proposed model and the best-performing baseline, Figueiredo et al.²⁰, based on the outcome of each of the 10 cross-validation folds. The analysis proved that the improvement in performance is not by chance.

The accuracy of the 1.2 percent increase was statistically significant (t(9) = 2.85, p < 0.05). In addition, other significant indicators also rose to a considerable extent: the difference between precision was significant (t(9) = 3.12, p < 0.01), and the F-Measure was also significant (t(9) = 2.98, p < 0.05). Such p-values are less than the standard level of 0.05, which is a significant statistical indicator that our approach can really and truly improve network security situation awareness in a reliable manner. A statistically significant benefit in all metrics is also confirmed by a similar statistical analysis of the proposed model versus the best baseline on the UNSW-NB15 dataset.

Discussion

As indicated in “Implementation results” section, the proposed security situation awareness algorithm is the best one in the experimentation results. The model is much better than the current state-of-the-art methods, with the overall accuracy of 98.91 and AUC of 0.9946 on CICDS2017 and accuracy of 98.51 and AUC of 0.9934 on UNSW-NB15 dataset. Here we will explain these results in detail, comment on their practical implications and scalability of our distributed approach, and place our work in the context of the rest of the network threat detection methodologies.

Efficacy of the hybrid feature selection strategy

The hybrid feature selection method based on D-KMeans clustering and MI analysis is one of the bases of our proposed model. The efficacy of this step is vividly depicted in the comparison of the performance of the entire model (“Proposed”) and the “Prop. (All Features)” as evident in Table 3. Our approach achieved a 98.91% accuracy compared to 89.67% by intelligently narrowing down the feature space- an increase of more than 9.2% on the CICIDS2017 dataset. Similarly, this approach could improve accuracy by 6.39% on UNSW-NB15.

This shows that the bigger feature set is not always better, and, on the contrary, irrelevant and redundant features may cause noise and reduce the performance of the classifier. This has been possible through clustering correlated features using D-KMeans and subsequently selecting the most informative and non-redundant feature within each cluster by a forward sequential method based on MI. It renders the ultimate feature vector compact and very discriminative, which is the primary cause why the model has a high precision (93.71% and 92.17%) and recall (98.95% and 98.41%) on CICIDS2017 and UNSW-NB15 datasets.

Scalability and real-time performance

This subsection provides further analysis on the basis of the processing time, memory consumption, and prediction latency. With our distributed design based on the MapReduce paradigm, it is specifically designed to deal with large volumes of data that are typical of modern network traffic. Subsets of the CICIDS2017 dataset were used to compare with a non-distributed (single-node) implementation of our algorithm. The one-node version overflowed the memory and failed to train on the full data. On the other hand, our distributed approach could scale. As an example, training the entire dataset offline required about 1.5 h on our testbed (a 5-node Spark cluster), which shows its ability to process large-scale data.

Online prediction should be fast so that a security model is feasible. Decision latency is high in a high-throughput environment. We discover that when the D-KNN model has been trained, the average latency to classify a single and new traffic flow is approximately 12 ms. Such fast prediction ability proves that our model is well adapted to be used in near-real-time security monitoring systems where early detection of threats is critical. Although the approach is optimized for big data infrastructure and would not be applicable to low-resource environments, the performance in its target environment justifies the architectural decisions.

Practical deployment: the accuracy versus cost trade-off

Performance to resource cost is another key consideration in any practical deployment. Our model is computationally intensive in the training step, and requires a distributed computing infrastructure like a Spark or Hadoop cluster. We understand that it is a significant infrastructure investment. We, however, contend that in a high-stakes environment like a financial institution, government networks, or critical infrastructure, such investment is worthwhile. The fact that the accuracy is 1.2% higher than the next-best method (Figueiredo et al.²⁰) and that it produces the least false negatives (as the recall is 98.95% and 98.41% on CICIDS2017 and UNSW-NB15, respectively) can be directly translated into avoiding catastrophic security breaches and millions of potential damages and operational integrity.

Additionally, one should distinguish between offline training and online prediction phases. Training is expensive in resources, and can be done periodically (e.g., nightly or weekly) on a back-end cluster. The operational, online model is lightweight in its prediction task, making it able to run efficiently on production systems to analyse live traffic without placing an excessive ongoing computational burden. The two-phase architecture renders the approach powerful and feasible to organizations that have access to big data processing infrastructure.

Supervised learning in the context of emerging threats

The last aspect of the discussion is the selection of supervised learning. Can a model that uses known attacks be effective in an environment of constantly changing threats? This is one of the basic controversies of cybersecurity studies.

The supervised approach was chosen to achieve high detection accuracy and low false positives with a wide variety of known attack types, which pose the vast majority of threats to organizations on a day-to-day basis. The accuracy and F-Measure scores show that our model is highly accurate in this task, and can be trusted to provide reliable alerts that a Security Operations Center (SOC) can act upon and help reduce alert fatigue.

We admit that supervised models, by their nature, cannot recognize new or zero-day attacks that do not resemble patterns in the training data. Supervised or unsupervised learning is therefore not the strongest security posture, but a synergistic combination of the two. Our high-efficacy model is deployed as a first line of defense against well-known, common threats. To achieve full protection, this system must be supplemented with an unsupervised anomaly-based detection engine that will be charged with the responsibility of identifying new and unusual activities. One of the directions of our future research is the creation of such a hybrid, multi-layered detection system.

Conclusion

The paper has presented a high-performance security situation awareness algorithm for network information transmission in big data environments. We propose a novel three-stage framework that presents a new feature selection algorithm, D-KMeans clustering with MI analysis, and uses a distributed KNN classifier to offer scalable and efficient threat detection.

Our approach is confirmed to be effective on the large CICIDS2017 dataset. The model recorded a detection accuracy of more than 98.9% and 98.5% on CICIDS2017 and UNSW-NB15 datasets, which is higher than the comparative methods by at least 1.2%. This shows that the framework has great potential in the development of strong and real-time intrusion detection systems that can manage the volume and complexity of current network traffic, thus improving in-network protection against a broad range of security threats.

Limitations and future work

Although our suggested model shows good performance, we recognize a number of limitations that also open up promising future research avenues.

The main weakness is that the model was tested on CICIDS2017 and UNSW-NB15 datasets. These are big sets of data, but the test of a limited source may limit the generalizability of our findings. Moreover, being a supervised learning model, its major advantage is that it can identify known threats that existed in the training data. Consequently, it automatically becomes limited in its ability to identify new zero-day attacks.

The other major weakness is that the model is dependent on the features that are extracted using unencrypted network traffic. As encryption (e.g., TLS 1.3) becomes more widespread, the efficacy of our current feature set would decrease, since it does not allow for inspecting packet payloads or some flow characteristics.

On the basis of these limitations, we have outlined the following areas of future work:

Generalizability and Robustness: Since it is one of the most important steps to overcome the limited-dataset problem, we will test and improve our framework on other popular network security datasets, such as TON-IoT.
Detection of novel and evolving threats: In order to go beyond detecting known attacks, we suggest a number of extensions. In the future, we will make an attempt to come up with a hybrid model that will combine our supervised D-KNN with an unsupervised anomaly detection sub-module to detect zero-day threats. The model will also include online learning mechanisms so that it can dynamically adapt to new threats, and we will also take into consideration transfer learning, where the knowledge in one network can be used in another network with little retraining.
Encrypted Traffic Analysis: To address the problem of encrypted traffic, we are going to be expanding the list of features. This involves the use of techniques of encrypted traffic analysis, such as extraction of metadata in TLS handshakes, analysis of the Sequence of Packet Lengths and Timings (SPLT), and analysis of the distribution of bytes, none of which requires decryption of the payload.
Other Architectural Improvements: We will consider other architectural improvements in machine learning. This involves the use of deep learning models like the LSTM networks to more effectively recognize time trends in traffic flows. Federated learning will also be considered as a priority, as it would enable privacy-preserving training of models across organizations. Lastly, it will be important to make the model adversarially robust to resist evasion attacks that are specifically crafted to mislead the machine learning classifier.

Data availability

All data generated or analysed during this study are included in this published article.

References

Wu, Y., Tao, J., & He, H. Research on network security situational awareness prediction method based on intelligent learning algorithm. In 2023 Asia-Pacific Conference on Image Processing, Electronics and Computers (IPEC) (pp. 332–335). IEEE (2023).
Caulfield, T., Ioannidis, C., & Pym, D. The US vulnerabilities equities process: An economic perspective. In Decision and Game Theory for Security: 8th International Conference, GameSec 2017, Vienna, Austria, October 23–25, 2017, Proceedings (pp. 131–150). Springer International Publishing (2017).
Chen, L., Fan, G., Guo, K., & Zhao, J. Security situation prediction of network based on lstm neural network. In IFIP international conference on network and parallel computing (pp. 140–144). Cham: Springer International Publishing (2020).
Xosanavongsa, C. Heterogeneous Event Causal Dependency Definition for the Detection and Explanation of Multi-Step Attacks (Doctoral dissertation, CentraleSupélec) (2020).
Zhang, J., Feng, H., Liu, B. & Zhao, D. Survey of technology in network security situation awareness. Sensors 23(5), 2608 (2023).
Article ADS PubMed PubMed Central Google Scholar
Awajan, I., Mohamad, M. & Al-Quran, A. Sentiment analysis technique and neutrosophic set theory for mining and ranking big data from online reviews. IEEE Access 9, 47338–47353 (2021).
Article Google Scholar
Baloch, S., Samsani, S. S. & Muhammad, M. S. Fault protection in microgrid using wavelet multiresolution analysis and data mining. IEEE Access 9, 86382–86391 (2021).
Article Google Scholar
Dong, Z., Su, X., Sun, L., & Xu, K. Network security situation prediction method based on strengthened LSTM neural network. In Journal of Physics: Conference Series (Vol. 1856, No. 1, p. 012056). IOP Publishing (2021).
Li, S., Zhao, D., & Li, Q. A framework for predicting network security situation based on the improved LSTM. EAI Endorsed Trans. Collaborative Comput. 4(13) (2020).
Diao, W. Network security situation forecast model based on neural network algorithm development and verification. In 2021 IEEE 4th International Conference on Automation, Electronics and Electrical Engineering (AUTEEE) (pp. 462–465). IEEE (2021).
Fan, Z., Zhao, P., Jin, B., Tang, Q., Zheng, C., & Li, X. Research on key method of cyber security situation awareness based on ResMLP and LSTM network. IETE J. Res., 1–15 (2023).
Lin, Z., Yu, J. & Liu, S. The prediction of network security situation based on deep learning method. Int. J. Inf. Comput. Secur. 15(4), 386–399 (2021).
Google Scholar
Zhang, L., & Liu, Y. Network security prediction and situational assessment using neural network-based method. J. Cyber Security Mobility, 547–568 (2023).
Chen, J., Li, F., Zou, Z., & Hou, Y. Network security situation awareness framework based on knowledge graph. In 2023 IEEE 7th Information Technology and Mechatronics Engineering Conference (ITOEC) (Vol. 7, pp. 1579–1583). IEEE (2023).
Jiang, R., & Wan, L. Network information security risk assessment method based on machine learning algorithm. In International Conference on Advanced Hybrid Information Processing (pp. 403–416). Cham: Springer Nature Switzerland (2022).
Yun, K., Jin, Y., Huang, Q., & Wang, Q. A network security approach based on machine learning. In 2023 IEEE International Conference on Integrated Circuits and Communication Systems (ICICACS) (pp. 1–5). IEEE (2023).
Jose, J. & Jose, D. V. Deep learning algorithms for intrusion detection systems in internet of things using CIC-IDS 2017 dataset. Int. J. Electrical Comput. Eng. (IJECE) 13(1), 1134–1141 (2023).
Article MathSciNet Google Scholar
Henry, A. et al. Composition of hybrid deep learning model and feature optimization for intrusion detection system. Sensors 23(2), 890 (2023).
Article ADS PubMed PubMed Central Google Scholar
Ho, S., Al Jufout, S., Dajani, K. & Mozumdar, M. A novel intrusion detection model for detecting known and innovative cyberattacks using convolutional neural network. IEEE Open J. Comput. Soc. 2, 14–25 (2021).
Article Google Scholar
Figueiredo, J., Serrão, C. & de Almeida, A. M. Deep learning model transposition for network intrusion detection systems. Electronics 12(2), 293 (2023).
Article Google Scholar
Luo, Y. (2022, October). Analysis and Research of Network Information Security Evaluation Model Based on Machine Learning Algorithm. In 2022 IEEE 4th Eurasia Conference on IOT, Communication and Engineering (ECICE) (pp. 193–196). IEEE.
Wang, Y., Zhao, B., Sun, Z., Huo, Z., Li, X., Wu, Y., & Li, J. A novel network topology sensing method for network security situation awareness. In International Conference on Algorithms and Architectures for Parallel Processing (pp. 82–101). Singapore: Springer Nature Singapore (2023).
Wang, J., Fu, Z., & Wu, Y. N. Information network security situation prediction method based on particle swarm optimization. In 2023 2nd International Conference on Artificial Intelligence and Autonomous Robot Systems (AIARS) (pp. 135–139). IEEE (2023).
Alzubi, O. A. A deep learning-based frechet and dirichlet model for intrusion detection in IWSN. J. Intell. Fuzzy Syst. 42(2), 873–883 (2022).
Google Scholar
Yadawad, R., Kulkarni, U. P. & Alzubi, J. A. Auto-metric graph neural network for attack detection on IoT-based smart environment and secure data transmission using advanced wild horse standard encryption method. Int. J. Comput. Netw. Inf. Secur. 16(3), 1–15 (2024).
Google Scholar
Kabilan, N., Ravi, V. & Sowmya, V. Unsupervised intrusion detection system for in-vehicle communication networks. J. Saf. Sci. Resilience 5(2), 119–129 (2024).
Article Google Scholar
Fernando, G. P., Florina, A. M., & Liliana, C. B. Evaluation of the performance of unsupervised learning algorithms for intrusion detection in unbalanced data environments. IEEE Access (2024).
Sharafaldin, I., Lashkari, A. H. & Ghorbani, A. A. Toward generating a new intrusion detection dataset and intrusion traffic characterization. ICISSp 1, 108–116 (2018).
Google Scholar
CICFlowMeter. Available online at (Accessed on December 27, 2024): (2024) https://github.com/CanadianInstituteForCybersecurity/CICFlowMeter
Alguliyev, R. M., Aliguliyev, R. M. & Sukhostat, L. V. Parallel batch k-means for Big data clustering. Comput. Ind. Eng. 152, 107023 (2021).
Article Google Scholar
Awad, F. H. & Hamad, M. M. Improved k-means clustering algorithm for big data based on distributed smartphoneneural engine processor. Electronics 11(6), 883 (2022).
Article Google Scholar
Zhou, H., Wang, X. & Zhu, R. Feature selection based on mutual information with correlation coefficient. Appl. Intell. 52(5), 5457–5474 (2022).
Article Google Scholar
Maillo, J., Triguero, I., & Herrera, F. A mapreduce-based k-nearest neighbor approach for big data classification. In 2015 IEEE Trustcom/BigDataSE/ISPA (Vol. 2, pp. 167–172). IEEE (2015).
Moustafa, N., & Slay, J. UNSW-NB15: a comprehensive data set for network intrusion detection systems (UNSW-NB15 network data set). In 2015 military communications and information systems conference (MilCIS) (pp. 1–6). IEEE (2015).

Download references

Funding

This work was supported by Talent introduction project fund of Chongqing University of Arts and Sciences. (Research on Enterprise Operation Optimization Strategy Based on Big Data and Artificial Intelligence(R2023RG17).

Author information

Authors and Affiliations

Chongqing University of Arts and Sciences, Yongchuan, Chongqing, 402160, China
Xianping Wang
Hunan University of Medicine, Huaihua, 418000, Hunan, China
Zhiyuan Zhou

Authors

Xianping Wang
View author publications
Search author on:PubMed Google Scholar
Zhiyuan Zhou
View author publications
Search author on:PubMed Google Scholar

Contributions

Xianping Wang and zhiyuan Zhou wrote the main manuscript text. Xianping Wang and zhiyuan Zhou reviewed the manuscript.

Corresponding author

Correspondence to Zhiyuan Zhou.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Wang, X., Zhou, Z. Security situation awareness algorithm of network information transmission based on big data. Sci Rep 15, 39058 (2025). https://doi.org/10.1038/s41598-025-24204-3

Download citation

Received: 02 February 2025
Accepted: 13 October 2025
Published: 07 November 2025
Version of record: 07 November 2025
DOI: https://doi.org/10.1038/s41598-025-24204-3

Subjects

Abstract

Similar content being viewed by others

Leveraging stacking machine learning models and optimization for improved cyberattack detection

Optimized extreme learning machines with deep learning for high-performance network traffic classification

An entropy and machine learning based approach for DDoS attacks detection in software defined networks

Introduction

Related works

Research methodology

Database

Proposed methodology

Threat model

Key assumptions

Describing traffic flow features

Feature selection based on clustering and MI analysis

Distributed traffic flow classification based on machine learning

Implementation results

Discussion

Efficacy of the hybrid feature selection strategy

Scalability and real-time performance

Practical deployment: the accuracy versus cost trade-off

Supervised learning in the context of emerging threats

Conclusion

Limitations and future work

Data availability

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Quick links