Enhancing smart contract security using a code representation and GAN based methodology

Murala, Dileep Kumar; Loucif, Samia; Rao, K. Vara Prasada; Hamam, Habib

doi:10.1038/s41598-025-99267-3

Download PDF

Article
Open access
Published: 03 May 2025

Enhancing smart contract security using a code representation and GAN based methodology

Dileep Kumar Murala¹,
Samia Loucif²,
K. Vara Prasada Rao¹ &
…
Habib Hamam^3,4,5,6

Scientific Reports volume 15, Article number: 15532 (2025) Cite this article

2366 Accesses
1 Citations
1 Altmetric
Metrics details

Subjects

Abstract

Smart contracts are changing many business areas with blockchain technology, but they still have vulnerabilities that can cause major financial losses. Because deployed smart contracts (SCs) are irreversible once deployed, fixing these vulnerabilities before deployment is critical. This research introduces a new method that combines code embedding with Generative Adversarial Networks (GANs) to find integer overflow vulnerabilities in smart contracts. Using Abstract Syntax Trees, we can vectorize the source code of smart contracts while keeping all of the important contract characteristics and going beyond what can be achieved with conventional textual or structural analysis. Synthesizing contract vector data using GANs alleviates data scarcity and facilitates source code acquisition for training our detection system. The proposed method is very good at finding vulnerabilities because it uses both GAN discriminator feedback and vector similarity measures based on cosine and correlation coefficients. Experimental results show that our GAN-based proactive analysis method achieves up to 18.1% improvement in accuracy over baseline tools such as Oyente and sFuzz.

Deep learning-based solution for smart contract vulnerabilities detection

Article Open access 16 November 2023

Taxonomic insights into ethereum smart contracts by linking application categories to security vulnerabilities

Article Open access 08 October 2024

Optimizing cryptographic protocols against side channel attacks using WGAN-GP and genetic algorithms

Article Open access 16 January 2025

Introduction

Smart contracts (SCs) have been adopted by banking, healthcare, insurance, and the IoT because of rapid blockchain technology development¹. SCs pose security risks due to their programming and operating environment. The BeautyChain (BEC) Token Attack and Proof of Weak Hand (PoWH) event focuses on shrinking smart contract (SC) vulnerabilities. Attackers exploited BEC, an Ethereum-based token, due to a SC vulnerability. PoWH, another SC-related Ponzi scheme, had similar weaknesses. The BEC cryptocurrency SC had an integer overflow in April 2018, allowing hackers to crash the market by issuing excessive tokens. The PoWH contract lost Ether due to similar issues^2,3. These incidents underline the need for integer overflow detection in our detection method. We detect the integer overflow vulnerability in Ethereum-based SCs using an accurate and adaptive technique. Gathering the source code of SCs to test them can provide difficulties when creating a model to find vulnerabilities. Related research indicates that the public can see roughly one percent of the source code for SCs⁴. Obtaining numerous suitable source codes is labor- and resource-intensive because of the Ethereum network and node limitations.

A comprehensive quality and security check of the dataset needs to be done. There may be privacy and legal issues with obtaining actual vulnerability data⁵. Paying close attention to the quantity and quality of the data collected is essential when building a trustworthy machine-learning model for code representation and vulnerability identification. Good samples are necessary for building accurate and generalizable models. A dearth of information may affect the model’s capacity to identify vulnerabilities. We present a few-shot learning strategy to discover SC shortcomings using data augmentation from traditional machine learning to solve these challenges. GAN technology facilitates the identification of deficiencies in SCs. Generative Adversarial Networks (GANs) consist of generators and discriminators⁶. The generator generates data; the discriminator compares that generated data with real data.

Crucially for data growth, we use GAN generators to continuously construct synthetic contracts that are near real SCs. We have trained the discriminator to distinguish between real and synthetic contracts. This way helps one to solve data shortages⁷. We maintain semantic and syntactic integrity by converting SC source code into spatial vectors using a code embedding technique⁸. GAN can find small samples by training a vector data set with a small set of samples to make several fake data sets that can be used to compare similarities. Our approach combines vector similarity ananalysing GAN discriminator feedback to detect SC integer overflow problems. The model uses GAN’s adversarial training approach and generates important characteristics from SCs to make SC security analysis more accurate and efficient⁹.

Research gap

SC vulnerability detection is critical to ensure security and trustworthiness¹⁰. Traditional methods, including fuzzing, symbolic execution, and formal verification, have automation, efficiency, and accuracy limitations. Recent efforts have focused on analyzing SC source code, but issues remain in preserving code structure, managing diverse information, and reducing dependence on large datasets¹¹. Furthermore, current feature-learning approaches struggle with effective vulnerability prediction¹². Addressing these shortcomings is critical to improving SC security.

Motivation

Blockchain technology, particularly SC, has revolutionized automation in various industries but faces significant security challenges¹³. Studies reveal inconsistencies in vulnerability detection tools, leading to high false positives and missed vulnerabilities. Traditional manual detection is inefficient, while machine learning offers a promising alternative. However, current models struggle with SC preprocessing, losing essential syntax and semantics¹⁴. Researchers are exploring vectorization and graph-based techniques, with Graph Neural Networks (GNNs) showing the potential to capture contract features. Limited access to high-quality datasets due to privacy and legal concerns remains challenging. Researchers are exploring data augmentation and few-shot learning to enhance SC vulnerability detection.

Research contributions

This article introduces a method for detecting vulnerabilities in SCs that combines GAN with code embedding. We train a GAN model on integer overflow vulnerabilities by transforming SC source codes into vector representations using code2vec. The GAN discriminator detects vulnerabilities and performs vector similarity analysis, while the GAN generator expands the dataset. Unlike traditional methods, this approach preserves contract properties through Abstract Syntax Tree (AST) vectorization and enables deep learning on limited data. Dual similarity detection accuracy is enhanced using GAN feedback, cosine similarity, and correlation coefficients. Our major contributions are as follows:

WE have developed a code embedding and GAN-based vulnerability detection methodology for integer overflow vulnerabilities.
AST-based representation retains contract properties.
We are using GAN data augmentation for small-sample deep learning.
We conducted scalability, accuracy, and efficiency tests on 150 public Ethereum contracts.

Research background

Smart contracts

In 1994, computer scientist and cryptographer Szabo coined “smart contract” to explain digital agreements¹⁵. A SC is a programming code-based agreement that automatically fulfills its obligations when all parties meet the prerequisites. SCs contain program codes. Developers can activate the SC function by writing the business logic code, such as programming and storing it in a blockchain system. “Bitcoin” was created by the cryptographic genius Nakamoto¹⁶ in 2008, and he was also a supporter of “Blockchain”. 2009 marked the beginning of the blockchain system. According to him, blockchain is a distributed, peer-to-peer (P2P) ledger that cannot be altered. In article¹⁷ published the white paper for Ethereum, he expanded the usage of blockchain technology beyond cash and launched the blockchain. He did this by adding SCs to the platform. The currently most often used blockchain platform, Ethereum, was the first one to allow SCs⁵. SCs thus enable programmatically controlled blockchain data in ways that transcend mere financial transactions. One important blockchain tool that allows users to create SCs to control blockchain data and offer digital money is Ethereum¹⁸. This capability greatly expands the scope of blockchain applications and facilitates diverse uses. When all of the parties involved have signed the contract, it is then encoded as a piece of programming code and recorded on the blockchain^19,20. Predetermined states, transition methods, and conditions trigger the execution of these agreements. After meeting these conditions, the SC is enabled in the blockchain network and checked by nodes before blockchain operations are performed. The blockchain is responsible for monitoring the execution of SCs to guarantee that they are carried out in an exact and precise manner when the conditions that trigger them are met²¹. There are four characteristics of SCs:

1.
It is transaction-driven, and there is no need for human involvement.
2.
An SC cannot stop once it has started.
3.
Since most blockchain nodes are required to verify the validity of the SC, every single node is aware of it.
4.
Adjust it to suit the various settings you are working with.

These characteristics ensure the safety of investors. There has also been growth in the management sector for SCs¹⁹. Because they can be traced, SCs are an excellent choice for electoral voting²². Among the many potential uses of this technology are digital asset copyright and the administration of corporate processes²³. SCs are utilized to determine access permissions in electronic medical records¹³. This method gives medical professionals increased control over patient information and helps prevent data leaks. The Internet of Things (IoT) and decentralized SCs combine significant development. SCs simplify complicated Internet of Things network procedures and increase resource sharing, improving the industry’s productivity, information security, and the costs of applications²⁴.

Smart contract vulnerabilities

Attackers can use SC vulnerabilities to modify program data, disrupt execution, or perform unauthorized activities¹⁴. These vulnerabilities allow resource theft, identity manipulation, and data compromise²⁵. Ethereum, the most popular SC blockchain, has the most vulnerabilities and losses, making it a research priority. Table 1 describes various smart contact vulnerabilities. SC vulnerabilities have three levels.

Table 1 Various smart contact vulnerabilities.

Full size table

These vulnerabilities must be addressed to secure SCs and prevent financial losses.

Securing SCs

The rising use of Ethereum has increased its monetary value, highlighting the risks of the cryptocurrency’s security¹². To reduce the risk of system instability, Ethereum’s original design included decentralized P2P networks and consensus algorithms, virtual machine technology to run SCs in a safe sandbox, and cryptographic methods to encrypt and verify data. Ethereum’s prototype contained all of these capabilities²⁶. SCs were developed with the goals of automating contract execution, removing the need for dependable third parties, and enhancing transaction security and efficiency. Researchers have focused a lot of attention on SC security because of the widespread use of these contracts in industries including finance, supply chain management, and the IoT.

The intractable nature of blockchain networks and the intricate programming languages used to create SCs are the root causes of security issues in this type of contract²⁷. Ethereum SCs offer Turing-complete programming but also come with additional security risks. A reentrancy attack on the DAO project in 2016 stole around $50 million, causing substantial economic damages and exposing SC vulnerabilities in design and execution. The Parity Wallet security event further highlighted the harm caused by SC programming issues. Due to design flaws and repair carelessness, the Parity multi-sig wallet contract was attacked in 2017, stealing or freezing $30 million in ether [6]. These instances show that SC development requires extensive security testing and regular monitoring. Under a cyberattack, 52.3 million NEM tokens worth 534 million USD were pilfered from Coincheck in 2018²⁸. The hacker might have taken a lot of money by unapproved transactions thanks to a security flaw in the hot wallet administration of the exchange. The incident at Coincheck demonstrated how critical it is to have secure methods of storing and protecting assets, particularly when using SCs¹⁰.

Since illegal financial transactions can result from inappropriate use or vulnerabilities in SCs, protecting exchange assets is a top priority¹¹. SCs raise cross-platform security issues. The broad application of blockchain technology makes SC security a global issue of relevance. August 2021 saw over USD 610 million stolen from Poly Network, an interoperable cross-chain protocol²⁹. The assailant changed the Keeper role in EthCrossChainData.sol and controlled asset movement by using a weakness in the SC to create a cross-chain transaction. This issue highlighted the security risks associated with the design and implementation of cross-chain protocols, which are particularly problematic in complex cross-chain communication and contract interactions⁷. These key historical events, which shaped blockchain technology and security testing, form the foundation of SC security. SCs must be verified and maintained for blockchain ecosystem security and growth.

Generative adversarial network (GAN)

A GAN. combines two neural networks to create data identical to its training. Image generation, editing, and text generation use GANs. In GANs, the generator creates indistinguishable data from actual data, while the discriminator distinguishes between the two. This is described as a zero-sum game. Training the two networks together improves the generator’s data realism³⁰. GANs realistically render faces, objects, and scenes. They can also edit photographs to remove undesired elements or change a person’s appearance. Additionally, GANs have generated poems, essays, and code. GANs generate new data and are used in many applications¹⁷. GANs will certainly become increasingly popular as they grow. The role and responsibilities of GAN in SC vulnerability detection are data Enhancement, Anomaly detection, and location of Vulnerability²⁴. GANs can greatly enhance SC vulnerability detection. We need research and development to overcome these obstacles and maximize their potential.

Methodology framework

Preparing the source code, training the model, and identifying similarities are the three steps of the GAN-based process for finding integer overflow vulnerabilities in SCs. Figure 1a shows the steps to prepare the source code for Abstract Syntax Tree (AST) generation by utilizing integer overflow vulnerability features to extract relevant and non-essential sections¹⁶. An AST is a tree representation of the syntactic structure of code. Code2Vec¹¹, on the other hand, is a neural network-based technique that takes ASTs as input to analyze and generate vector embeddings; in this case, contract vectors. ASTs to contract vectors can be accomplished with Code2vec¹¹. Preprocessing ensures data consistency for all SCs.

The GAN architecture consists of a generator and discriminator, each implemented with three fully connected layers. The generator uses 128, 256, and 128 layers, while the discriminator uses 256, 128, and 64 nodes, respectively. ReLU activation functions are applied to all hidden layers, and a sigmoid function is used at the output layer of the discriminator. The models were trained using the Adam optimizer with a learning rate of 0.0002, a batch size of 32, and 200 training epochs.

Abstract Syntax Tree (AST) paths

AST pathways are structured program syntax tree node associations. They capture syntactic and structural interdependence by connecting code parts with nodes. These routes describe code logic and flow to help machine learning models extract features.

Code2vec

neural network-based code embedding model that converts source code to vectors. It generates distributed embeddings from AST routes to capture code semantics and syntax. These embeddings map code snippets into a continuous vector space for pattern recognition, enabling vulnerability identification, code summarisation, and categorization.

To train a GAN model, the generator and discriminator will compete to create synthetic contracts that look more realistic and to separate actual contracts from fake ones on the processed training set. At last, the generator can distribute target data accurately while producing high-quality synthetic samples. The trained generator will increase the size of the test set. Identifying an integer overflow problem in a SC necessitates the initial vectorization of the source code. The trained discriminator receives the vector to construct the security label. When the label is positive, we compare the contract vector with the expanded test set. The detecting mechanism determines the contract’s susceptibility using the similarity threshold coefficient. The contract is at risk when the similarity coefficient goes beyond the cutoff. Recap of the detection system process:

1.
Preprocessing the source code: Preprocess source code and build vulnerability-specific contract vectors.
2.
Model training: GAN generator and discriminator training. The discriminator separates actual and synthetic contracts, and the generator generates high-quality synthetic contracts.
3.
Finding security holes: Using the discriminator, you may determine the target contract’s vector. Determine vector similarity and use the similarity threshold coefficient to check for vulnerability if the label is affirmative.

Data augmentation with GAN

The generation of susceptible contract data involves code preprocessing, code embedding, and code generation, as shown in Fig. 1b.

Code preprocessing

Preprocessing makes sure that the data for AST-based analysis is clean and standardized. To keep the code consistent, it’s important to remove comments, standardise variable names, and deal with spaces. There is a possibility that private user and transaction information is included in the SCs source code. Improper processing during model training could lead to a breach of data protection regulations. Solidity’s SC programming language allows customising identifiers. The naming conventions and programming techniques of programs and developers differ in their coding styles. For GAN modelling and similarity judgement, vectorisation of source code allows for translating semantically identical code segments into separate vectors. This means that source code preprocessing is required. Rule preprocessing:

Maintain integer overflow vulnerability features.
Maintain code semantics and structure.
Code embedding input specification.
The following two aspects will be standardized.

The preprocessing logic is illustrated in the following pseudocode:

Description of code features

Establishing integer overflow characteristics is crucial. SC programming language Solidity’s integer types cause integer overflow. In EVM, integers are data types with a fixed size, no signs, and ranges defined by bit width. From uint8 to uint256, Solidity can handle 8-bit unsigned numbers. Umint256 is the notation for a 256-bit unsigned integer. The result will overflow and become 0 when adding to a uint8 variable that stores 255, as shown in Fig. 2. Figure 3 provides a sample Solidity code for creating smart contracts.

Embed input details

The AST produced by the Solidity parser ANTLR has to be handled in line with the code2vec embedding criteria. More especially, the traversable AST has to fit the following definition. In this approach, we are using various parameters like Non-Terminal (NT), Terminal Node (TN), Set (A), Root Node (R), Maps Non-Terminal (MNT), and Maps Terminal Node (MTN). One can depict the AST of an SC as $\langle NT, TN, A, R, MNT, MTN \rangle$, where $NT$ represents non-terminal nodes and $TN$ represents terminal nodes. The set $A$ contains values of TNs, while $R \in NT$ AST root node.

The function $MNT: NT \rightarrow (NT \cup TN)^{*}$ maps non-terminal nodes to their respective child nodes, and the function $MTN: TN \rightarrow A$ associates terminal nodes with values. All child node listings list each node once except the root. Figure 4 describes the AST smart contract solution.

An AST path is a directed sequence of nodes that represents a syntactic relationship between two terminal elements in the AST of a program. It captures the structure and direction of traversal (upward or downward) between nodes and characterizes the relationships between code tokens. These paths form the building blocks for constructing code semantics in later stages.

AST Paths: AST paths are defined as sequences of length $L :$

$$\begin{aligned} L = x_1 m_1 \cdot x_2 m_2 \cdots x_L m_L \cdot x_{L+1} m_{L+1}. \end{aligned}$$

where $x_1, x_{L+1} \in TN$ (terminal nodes). Non-terminal nodes are $x_j \in NT$ for $j \in [2..L].$

The AST movement direction is represented by $m_j \in \{ \uparrow , \downarrow \}$: - $\uparrow$ (up) denotes that $x_j$ is a child of $x_{j+1}$ (rootward movement). - $\downarrow$ (down) denotes that $x_{j+1}$ is a child of $x_j$ (going away from the root).

The starting and ending nodes of a path $p_T$ are $S(p_T)$ and $E(p_T)$. Define Path Context for AST path $p_T$; the context is the triplet:

$$\begin{aligned} \langle a_S, p_T, a_T \rangle \end{aligned}$$

$$\begin{aligned} a_S = \varphi (S(PT)) \end{aligned}$$

(1)

$$\begin{aligned} a_T = \varphi (E(PT)) \end{aligned}$$

(2)

The beginning node value is represented by $A = MTN(S(PT)) .$

The terminal node value is represented by $A_{TN} = MTN(E(PT)).$

Code embedding

code2vec is a neural embedding technique that converts source code into continuous vector representations by learning from syntactic paths (AST paths) and their contexts. It enables learning about code structure and semantics in a way suitable for machine learning applications such as vulnerability detection.

Once the semantic analysis is complete, we use code2vec to create vector representations and train the code. Specifically, code2vec captures the links between code elements by extracting path and context properties from the Abstract Syntax Tree (AST). Paths represent syntactic structures like function calls and variable assignments by connecting two nodes with directed edges. Context characteristics give code elements functions and location details. For code embedding, use the following Eq. (3):

$$\begin{aligned} CE = \sum _{j=1}^{n} AW_j VR_j \end{aligned}$$

(3)

$$\begin{aligned} PT_j = NNF(S_j, E_j, Node_j) \end{aligned}$$

(4)

CE represents the final code embedding vector for a given smart contract. Here n is the total number of AST paths extracted from the source code.

Each $PT_j$ is the jpath context in the Abstract Syntax Tree and is transformed into a path vector $VR_j$ using a neural network function (NNF) Eq. (4). This function takes as input three components: $S_j$ (the vector representation of the starting token), $E_j$ (the vector of the ending token), and $Node_j$ (the sequence of node types along the path). The output $VR_j$ captures the semantic and structural features of the code path.

$AW_j$ is the attention weight assigned to the jth path. It reflects the importance or relevance of that path in the overall context of the code. Paths that contribute more to the code’s functional meaning are given higher weights during aggregation.

The final embedding vector CE is thus a weighted sum of all the individual path vectors, where each path’s influence is modulated by its attention weight. This method captures both local and global code semantics and enables the detection of subtle patterns related to vulnerabilities.

Figure 5a describes an overview of code embedding. In this figure code2vec concatenates paths and context information into vector representations to encode code element semantics. Code2vec uses this approach to create vectors for functions, variables, operators, and other code structures. Code2vec vectorises the code segment after integrating all code element vectors. The AST was parsed using solidity-parser-antlr version 0.4.13. Vector embeddings were generated with code2vec as per Alon et al. (2019), using the implementation at: https://github.com/tech-srl/code2vec, which we adapted to support Solidity syntax.

Code generation

Figure 5b shows Synthetic code generation process. GAN-generated Solidity code vectors from the vector dataset. While the GAN discriminator differentiates actual from synthetic vectors, the generator generates synthetic code vectors from random noise. The generator produces vectors that mimic Solidity code vectors through iterative training, while the discriminator improves its ability to separate them. The discriminator loss function is LF$_d$ (Eq. 6), and the generator is LF$_g$ (Eq. 5). Generator: g, discriminator: d, real statement: r, random noise: n, and distributions: DIS$_data$(r) and Dis(n).

$$\begin{aligned} \mathscr{L}\mathscr{F}_g = \mathbb {E}_{z \sim (z)} [\log (1 - d(g(n)))] \end{aligned}$$

(5)

$$\begin{aligned} \mathscr{L}\mathscr{F}_d = \mathbb {E}_{r \sim Dis_{\text {data}}(r)} [\log d(r)] + \mathbb {E}_{n \sim Dis(n)} [\log (1 - d(g(n)))] \end{aligned}$$

(6)

As generator and discriminator achieve Nash equilibrium, GAN training ceases. The generator may produce realistic-looking Solidity code vectors after training. These are a synthetic contract vector. Using this method, we can augment the vulnerable dataset with numerous synthetic contract vectors. Vector similarity detection will use the updated vulnerable contract dataset.

Dual similarity detection

Discriminator GAN analysis

During GAN training, only vectors that have integer overflow vulnerabilities are used. This means the trained discriminator can tell the difference between actual and fake contracts and identify those with integer overflow vulnerabilities.

Vector similarity analysis

Vector similarity analysis is a fundamental criteria for automated detection. Contract vectors including structural and semantic information of the source code are produced using Code2vec.

$$\begin{aligned} \cos (a, b) = \frac{1}{n} \sum _{i=1}^{n} \frac{a \cdot b_i}{\Vert a\Vert \Vert b_i\Vert } \end{aligned}$$

(7)

$$\begin{aligned} CC = \frac{1}{r} \sum _{k=1}^{r} \frac{\sum _{j=1}^{c} (a_l - \bar{a})(b_{kj} - \bar{b_k})}{\sqrt{\sum _{l=1}^{c} (a_l - \bar{a})^2 \sum _{l=1}^{c} (b_{kl} - \bar{b_k})^2}} \end{aligned}$$

(8)

Where a is the target contract vector, while b is the vulnerable contract vector. The kth vector in y is b$_k$, while r and c represent b’s size and a’s dimensionality. Additionally, Let $\bar{a}$ and $\bar{b}$ $_k$

represent vector mean values, cos(a, b) represent cosine similarity (Eq. 7), and CC represents correlation coefficient. The target contract is brittle if the Pearson correlation coefficient and cosine similarity are both high (Eq. 8). To make the detection more precise, we take the correlation coefficient and weighted average of the cosine similarity; to check if the target contract is vulnerable to integer overflow, we apply a threshold (Fig. 6).

Experimental results and analysis

We have already addressed the procedures necessary to enhance the vulnerable contract dataset utilizing the GAN model and the methodology for converting the source code of SCs into vectors that exhibit structural and semantic attributes. Additionally, we showcased the process of evaluating SC integer overflow vulnerabilities using the GAN discriminator in conjunction with vector similarity. This part presents the proposed method for finding SC integer overflow vulnerabilities at their paces. Before we experiment, we will provide the experimental setup and dataset to establish the appropriate vector similarity cutoff coefficient and compare our results to other tools.

Experimental design

The investigations used a Windows 10 PC with an Intel Core CPU (2.30 GHz), 16GB RAM, and a GeForce RTX 2060. Code2vec (Version 2020 release) [23] derives feature vectors from SC source code, whereas Solidity-parser-antlr (Version Version 0.4.11)[102] produces abstract syntax trees (ASTs). We ran two experiments to see if the suggested technique might find integer overflow issues in SC code. This method averages cosine and Pearson correlation coefficients to determine vector similarity. We tested numerous weights and thresholds to discover the best parameters for finding SC source code vulnerabilities and assessing recall and accuracy. We built training and testing subsets from our core dataset (enhanced-smart-contracts-dataset.CSV) using open-source SCs with security classifications. This enabled us to assess the effectiveness of vulnerability detection.

Test data and evaluation criteria

This section details the steps required to test the vulnerability detection approach, including collecting data, selecting assessment metrics, and establishing an experimental comparison environment. Two hundred Etherscan contracts tested our integer overflow vulnerability detection approach in SC source code. “Etherscan” connects Ethereum nodes to analytics and block explorers. These SCs can be checked for security properties, Solidity source code, and contract address. As Table 2 indicates,source dataset summery. Fifty SCs with integer overflow flaws were incorporated into the training set for GAN models and vector similarity investigations. There are 150 SCs in the testing set; 80 are secure, and 70 have integer overflow issues. The testing set compares our and other detection methods.

Table 2 Source dataset summary.

Full size table

Table 3 Enhanced dataset summary.

Full size table

Table 3 offers an Enhanced Dataset Summary. Using the trained GAN model, the analytical dataset for vector similarity identification comprising 50 genuine contracts in the training set, we also produced 1,950 counterfeit contracts.

Criteria for assessment Using sFuzz³¹ and oyente²⁵, we discovered SC vulnerabilities following dataset generation. The following ideas directed the selection of these two instruments:

The detecting tool source code is public.
Many vulnerability detection programs use this tool to test their performance.
This tool detects our vulnerabilities.

These two methods helped us identify areas of weakness in the test set. Table 4 displays the detection results from each tool’s performance testing using confusion matrices; we then compared and contrasted the detection data to show the benefits and effectiveness of the vulnerability detection technique. Innumerable TPs, TNs, FPs, and FNs are present. To test how well the recognition model is doing, we use the confusion matrix. This technique delineates several efficiency metrics, with Eqs. (9)–(13) specifying Accuracy (ACC), Precision, Recall, F1-Score, and Overfitting Rate (OR), respectively.

Table 4 A summary of detection results of the enhanced data.

Full size table

Accuracy (ACC) Thus, we test the detection model using Accuracy (ACC):

$$\begin{aligned} {\textbf {ACC}}= & \frac{TP + TN}{FP + FN + TP + TN} \end{aligned}$$

(9)

$$\begin{aligned} {\textbf {Precision}}= & \frac{TP}{TP + FP} \end{aligned}$$

(10)

$$\begin{aligned} {\textbf {Recall}}= & \frac{TP}{TP + FN} \end{aligned}$$

(11)

$$\begin{aligned} {\textbf {F1-Score }}= & 2 \times \frac{\text {Precision} \times \text {Recall}}{\text {Precision} + \text {Recall}} \end{aligned}$$

(12)

$$\begin{aligned} {\textbf {Overfitting Rate (OR)}}= & \frac{\text {Training Accuracy} - \text {Test Accuracy}}{\text {Training Accuracy}} \end{aligned}$$

(13)

To provide a clear comparative benchmark, we evaluated our approach against sFuzz and Oyente using the same test set of 150 contracts. Our GAN-based method achieved an overall detection accuracy improvement of 12.4% over sFuzz and 18.1% over Oyente. Furthermore, it demonstrated higher F1-score and precision values, confirming its superior balance of sensitivity and specificity in detecting integer overflow vulnerabilities.

To validate the specific contribution of GAN-generated synthetic data to model performance, we conducted an ablation study by training the system without synthetic contracts. In this setup, the F1-score dropped from 0.91 to 0.84, and accuracy declined by 9.7%, confirming the critical role of GAN-based data augmentation in addressing the data scarcity challenge. These results empirically support the claim that the proposed method benefits from the synthetic vector generation process.

Vector similarity parameter test

This part will test composite vector similarity detection settings to identify SCs. Use the best parameters to detect integer overflow vulnerabilities to improve the provided strategy. We will test how modifying the vector similarity threshold and cosine similarity weight affects detection performance. This Eq. (14) is defined before the tests to determine the results of the vector similarity:

$$\begin{aligned} S = \frac{\cos (x,y) \cdot W + r \cdot (1 - W)}{2} \end{aligned}$$

(14)

The cosine similarity weight, W, is denoted by S; the vector similarity result is denoted by where W is an integer between 0 and 1. If S exceeds threshold value T, where T is an integer between 0 and 1, the target contracts as integer overflow vulnerabilities. The target contract has no integer overflow issues if S is less than T. Through experimental means, we determine the cosine similarity weight W. We vary the weight to assess the impact on model correctness and set the threshold to 0.85, as demonstrated in Fig. 7. Based on the results of the experiments, the model achieves its highest accuracy when W = 0.74.

To offer a more reliable assessment beyond accuracy, we computed precision, recall, and F1-score for each threshold and weight configuration. The model achieved an F1-score of 0.91 when W = 0.74 and T = 0.9, indicating strong balance between sensitivity and specificity. Precision was 0.89 and recall was 0.94 in this optimal configuration.

Semantics and code structure: cosine similarity is a good measure of these. Integer overflow vulnerabilities share the same semantics and code design. We can find these vulnerabilities by computing the cosine similarity between the codes.

Resilience: Cosine similarity can withstand outliers and noise. Modifications to the code, such as comments, whitespace, etc., may be incorporated into practical applications. To some extent, cosine similarity can mask these distinctions, making the model more robust.

To evaluate statistical significance, we performed repeated trials (n = 10) for each parameter setting. A paired t-test on detection accuracy across weight values showed statistically significant differences (p< 0.05), confirming that the chosen configuration (W = 0.74) improves detection performance in a meaningful way.

Through tests, we must identify the threshold T after determining W. Model sensitivity to vector similarity depends on threshold T. To achieve a higher level of accuracy, the model demands a higher degree of vector similarity when the threshold is raised. False positives may grow with a low threshold and false negatives with a high one. There is no perfect threshold value. Beyond affecting model complexity, the threshold can have a negative impact on model performance if set too high or too low.

We also examined the potential risk of overfitting in the GAN-generated synthetic contracts. Since these contracts are derived from a small training set, there is a chance that the generator could produce overly similar instances, reducing generalizability. To mitigate this, we injected noise variability into the generator’s latent space and applied dropout regularization in the discriminator during training. In future work, we plan to adopt adversarial validation techniques and external datasets to further test the robustness of the model against synthetic overfitting bias.

Figure 8 displays the lab’s concluding findings. We attained detection accuracy and generalisability by calibrating the model threshold to 0.9, effectively balancing the FP and FN rates. Our methodology preserves necessary structural and semantic information while converting SCs into small vector representations via a code embedding method. This method improves the efficiency and potency of vulnerability identification.

Conclusion

This research has proved that a one-of-a-kind method for locating integer overflow vulnerabilities in SCs is both valuable and empirically validated. This was accomplished through the use of code embedding and GANs. It can be concluded that the strategy’s effectiveness has been demonstrated throughout this work. The suggested way to get around the big problem of not having enough data in SC security research is by using GANs to make fake contract vector data that maintains real-world contracts’ structural and semantic properties. The structure and semantic parts of real-world contracts are kept, which makes this possible.

This indicates that the technique can help address the challenge of data scarcity in smart contract vulnerability detection. By combining discriminator feedback with vector similarity analysis, the proposed approach can uncover vulnerabilities even with limited training data. While the results demonstrate promising accuracy, further validation using additional tools such as Mythril and Slither, as well as metrics like precision, recall, and F1-score, will be necessary to comprehensively assess and benchmark the method’s performance.

Compared to baseline tools, our method improves detection accuracy by 12.4% over sFuzz and 18.1% over Oyente. These gains are accompanied by stronger F1-score and precision values, indicating more balanced performance. To validate the specific contribution of GAN-generated synthetic data, we performed an ablation study comparing detection results with and without synthetic vectors. The inclusion of synthetic data improved the F1-score from 0.84 to 0.91, demonstrating the effectiveness of GANs in mitigating data scarcity.

Before SCs are implemented, this method offers a valuable substitute to improve the security of SCs and lower the risk of significant financial losses.

Data availability

The datasets generated and/or analyzed during the current study are provided as supplementary files with this article (enhanced-smart-contracts-dataset.csv).

References

Rachad, A. et al. A smart contract architecture framework for insurance industry using blockchain and business process management technology. IEEE Eng. Manage. Rev. 52, 55–68. https://doi.org/10.1109/EMR.2023.3348431 (2023).
Article Google Scholar
Praitheeshan, P., Pan, L., Yu, J., Liu, J., & Doss, R. Security analysis methods on ethereum smart contract vulnerabilities: a survey. arXiv:1908.08605 (2019).
Mazhar, T. et al. Generative AI, IoT, and blockchain in healthcare: application, issues, and solutions. Discov. Internet Things 5, 5. https://doi.org/10.1007/s43926-025-00095-8 (2025).
Article Google Scholar
Ashizawa, N. et al. Eth2Vec: learning contract-wide code representations for vulnerability detection on ethereum smart contracts. In 3rd ACM Intern, Sympo. on Blockchain and Secure Critical Infr. Hong Kong 47–59 (2021).
Hu, C. et al. Smart contract assisted privacy-preserving data aggregation and management scheme for smart grid. IEEE Trans. Depend. Secure Comput. 21(4), 2145–2161. https://doi.org/10.1109/TDSC.2023.3300749 (2023).
Article Google Scholar
Li, C. et al. Smart contract-based decentralized data sharing and content delivery for intelligent connected vehicles in edge computing. IEEE Trans. Intell. Transp. Syst. 25, 14535–14545. https://doi.org/10.1109/TITS.2024.3388422 (2024).
Article Google Scholar
Yao, P. et al. Security-enhanced operational architecture for decentralized industrial internet of things: a blockchain-based approach. IEEE Internet Things J. 11, 11073–11086. https://doi.org/10.1109/JIOT.2023.3329352 (2023).
Article Google Scholar
Das, D. et al. A secure blockchain enabled v2v communication system using smart contracts’’. IEEE Trans. Intell. Transp. Syst. 24, 4651–4660. https://doi.org/10.1109/TITS.2022.3226626 (2022).
Article Google Scholar
Ngo, D. M. et al. A scalable security approach in IoT networks: smart contracts and anomaly-based IDS for gateways using hardware accelerators. IEEE Access 12, 145. https://doi.org/10.1109/ACCESS.2024.3486605 (2024).
Article Google Scholar
Zhang, Y. et al. An efficient smart contract vulnerability detector based on semantic contract graphs using approximate graph matching. IEEE Internet Things J. 10(24), 21431–21442 (2023).
Article Google Scholar
Alon, U., Zilberstein, M., Levy, O., & Yahav, E. Code2vec: learning distributed representations of code. In Proceedings of the ACM on Programming Languages, 3, POPL 1–29 (2019).
Le, T. T. H. et al. Robust vulnerability detection in solidity-based ethereum smart contracts using fine-tuned transformer encoder models. IEEE Access 12, 145 (2024).
Article Google Scholar
Jiang, F. et al. Enhancing smart-contract security through machine learning: a survey of approaches and techniques. Electronics 12, 2046. https://doi.org/10.3390/electronics12092046 (2023).
Article Google Scholar
Shyamasundar, R.K. et al. Enhancing robustness of smart contracts through declarations. In ICBC 2024. Lecture Notes in Computer Science, vol. 15425 (Springer, 2025). https://doi.org/10.1007/978-3-031-77095-1-4.
Dong, C., Li, Y., Tan, L. & Si, X. et al. A New Approach to Prevent Reentrant Attack in Solidity Smart Contracts Blockchain Technology and Application 83–103 (Springer, 2020)
Bond, F. A solidity parser for JS built on top of a robust ANTLR4 grammar. https://github.com/solidity-parser/parser (2019).
Qiu, H. et al. A Dynamic Scalable Blockchain Based Communication Architecture for IoT Smart Blockchain 159–166 (Springer, 2018).
Hwang, S.-J. et al. CGGNet: compiler-guided generation network for smart contract data augmentation. IEEE Access 12, 56. https://doi.org/10.1109/ACCESS.2024.3427829 (2024).
Article Google Scholar
Zhang, J. X. et al. A novel neural source code representation based on abstract syntax tree. In 2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE) 783–794 (IEEE, 2019).
Yazeed, Y. G. et al. The role of blockchain to secure the internet of medical things. Nature Sci. Rep. 14, 16908 (2024).
Google Scholar
Wan, J. et al. A blockchain-based solution for enhancing security and privacy in smart factory. IEEE Trans. Ind. Inf. 15, 3652–3660. https://doi.org/10.1109/TII.2019.2894573 (2019).
Article Google Scholar
Hasan, H. R. et al. Blockchain-enabled telehealth services using smart contracts. IEEE Access 9, 56. https://doi.org/10.1109/ACCESS.2021.3126025 (2021).
Article Google Scholar
Haritha, T. & Anitha, A. Multi-level security in healthcare by integrating lattice-based access control and blockchain-based smart contracts system. IEEE Access 11, 114322–114340. https://doi.org/10.1109/ACCESS.2023.3324740 (2023).
Article Google Scholar
Krichen, M. Strengthening the security of smart contracts through the power of artificial intelligence. Computers 12, 107. https://doi.org/10.3390/computers12050107 (2023).
Article Google Scholar
Luu, L. et al. Making smart contracts smarter. In Proc. of the 2016 ACM SIGSAC Conference on Computer and Communications Security 254–269 (2016)
Ayub, M. F. et al. Secure consumer-centric demand response management in resilient smart grid as industry 5.0 application with blockchain-based authentication. IEEE Trans. Consumer Elect. 70, 145. https://doi.org/10.1109/TCE.2023.3320974 (2023).
Article Google Scholar
Tsankov, P. et al. Securify: practical security analysis of smart contracts. In ACM SIGSAC Conference on Computer and Communications Security (CCS) 2018, Toronto, Canada 67–82 (2018).
Fröwis, M. & Böhme, R. In code we trust? Measuring the control flow immutability of all smart contracts deployed on ethereum. In Data Privacy Management, Cryptocurrencies and Blockchain Tech.: ESORICS 357–372 (Springer, 2017).
Wang, X., Tian, S. & Cui, W. ContractCheck: checking ethereum smart contracts in fine-grained level. IEEE Trans. Softw. Eng. 50(7), 1789–1806 (2024).
Article Google Scholar
Fang, Y. et al. PaVM: a parallel virtual machine for smart contract execution and validation. IEEE Trans. Parallel Distrib. Syst. 35(1), 186–202 (2024).
Article Google Scholar
Nguyen, T. D. et al. Sfuzz: an efficient adaptive fuzzer for solidity smart contracts. In Proc. ACM/IEEE 42nd International Conference on Software Eng.) 778–788 (2020).

Download references

Author information

Authors and Affiliations

Department of Computer Science and Engineering, Faculty of Science and Technology, ICFAI Foundation for Higher Education, Hyderabad, 501203, Telangana, India
Dileep Kumar Murala & K. Vara Prasada Rao
College of Technological Innovation, Zayed University, P.O. Box 144534, Abu Dhabi, UAE
Samia Loucif
Faculty of Engineering, Université de Moncton, Moncton, NB, E1A3E9, Canada
Habib Hamam
School of Electrical Engineering, University of Johannesburg, Johannesburg, 2006, South Africa
Habib Hamam
International Institute of Technology and Management (IITG), Av.Grandes Ecoles, 1989, Libreville, Gabon
Habib Hamam
Bridges for Academic Excellence–Spectrum, Tunis-Centre Ville, 1002, Tunisia
Habib Hamam

Authors

Dileep Kumar Murala
View author publications
Search author on:PubMed Google Scholar
Samia Loucif
View author publications
Search author on:PubMed Google Scholar
K. Vara Prasada Rao
View author publications
Search author on:PubMed Google Scholar
Habib Hamam
View author publications
Search author on:PubMed Google Scholar

Contributions

Author 1(Dileep Kumar Murala): Conceptualization, methodology design, formal analysis, and manuscript writing. Author 2 (Samia Loucif): Data collection, experimental setup, software implementation, and validation. Author 3 (K Vara Prasada Rao): Literature review, result analysis, visualization, and manuscript editing. Author 4 (Habib Hamam): Supervision, project administration, funding acquisition, and final manuscript review.

Corresponding authors

Correspondence to Dileep Kumar Murala or Samia Loucif.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Supplementary Information.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Murala, D.K., Loucif, S., Rao, K.V.P. et al. Enhancing smart contract security using a code representation and GAN based methodology. Sci Rep 15, 15532 (2025). https://doi.org/10.1038/s41598-025-99267-3

Download citation

Received: 18 February 2025
Accepted: 18 April 2025
Published: 03 May 2025
DOI: https://doi.org/10.1038/s41598-025-99267-3

Keywords

This article is cited by

Integrating IoT and WSN: Enhancing quality of service through energy efficiency, scalability, and secure communication in smart systems
- Sunawar khan
- Tehseen Mazhar
- Habib Hamam
Peer-to-Peer Networking and Applications (2025)

Subjects

Abstract

Similar content being viewed by others

Deep learning-based solution for smart contract vulnerabilities detection

Taxonomic insights into ethereum smart contracts by linking application categories to security vulnerabilities

Optimizing cryptographic protocols against side channel attacks using WGAN-GP and genetic algorithms

Introduction

Research gap

Motivation

Research contributions

Research background

Smart contracts

Smart contract vulnerabilities

Securing SCs

Generative adversarial network (GAN)

Methodology framework

Abstract Syntax Tree (AST) paths

Code2vec

Data augmentation with GAN

Code preprocessing

Description of code features

Embed input details

Code embedding

Code generation

Dual similarity detection

Discriminator GAN analysis

Vector similarity analysis

Experimental results and analysis

Experimental design

Test data and evaluation criteria

Vector similarity parameter test

Conclusion

Data availability

References

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Competing interests

Additional information

Publisher’s note

Supplementary Information

Supplementary Information.

Rights and permissions

About this article

Cite this article

Share this article

Keywords

This article is cited by

Integrating IoT and WSN: Enhancing quality of service through energy efficiency, scalability, and secure communication in smart systems

Search

Quick links