An evolutionary computation-based sensitive pattern hiding model under a multi-threshold constraint in healthcare

Sharma, Shivani; Sharma, Rohan; Kumar, Sachin; Min, Hong

doi:10.1038/s41598-025-03346-4

Download PDF

Article
Open access
Published: 08 October 2025

An evolutionary computation-based sensitive pattern hiding model under a multi-threshold constraint in healthcare

Shivani Sharma¹,
Rohan Sharma¹,
Sachin Kumar² &
…
Hong Min³

Scientific Reports volume 15, Article number: 35062 (2025) Cite this article

167 Accesses
Metrics details

Subjects

Abstract

In the domain of collaborative frequent pattern mining, the preservation of privacy has emerged as a critical area of investigation as the data procured by the analytical arm of business organizations may contain critical sensitive patterns. Further, this investigation may cause the disclosure of sensitive information. Various Evolutionary techniques have been proposed in the past to efficiently investigate such sensitive patterns while preserving data privacy. These techniques utilized various nature-inspired evolutionary-based algorithms like Particle Swarm Optimization (PSO) and Ant Colony Optimization (ACO) for masking such confidential information before sharing data to the business organizations. However, most of them either choose to delete entire sensitive transactions for masking confidential information or by selecting a victim item and its subsequent deletion based on a single parameter such as the length or frequency of a sensitive item. This may cause various side effects, and hence reduces the utility of sanitized datasets. In this paper, we propose a novel and innovative Particle Swarm Optimization (PSO) based algorithm specifically designed to address the challenge of concealing sensitive patterns within the constraints of a multi-threshold framework. The proposed algorithm emphasizes enhancing the utility of sanitized datasets by selecting victim items based on multiple parameters, unlike the schemes proposed in the past. For making the scheme suitable for real applications, we propose to introduce the dynamic multithreshold-based framework, the algorithm utilizes the bi-variate normal distribution to determine the dynamic threshold value for each sensitive item set. This helps in improving the utility of the underlying dataset while making lesser modifications for hiding sensitive knowledge. The empirical findings substantiate the effectiveness of the proposed algorithm in concealing sensitive patterns over benchmark FIMI datasets, Heart Disease, and Heart Attack Prediction datasets. The experimental results show the superiority of the proposed scheme by reducing the side-effect with minimum loss of data over the existing PSO and ACO-based algorithms. We noted that the Failure-to-Hide(FTH) named side effect is significantly lower in most of the cases for our algorithm in comparison to the existing algorithms.

Application of information theoretic feature selection and machine learning methods for the development of genetic risk prediction models

Article Open access 02 December 2021

Exploring the cytotoxic effects of bioactive compounds from Alcea rosea against stem cell driven colon carcinogenesis

Article Open access 18 February 2025

Cooperative metaheuristic algorithm for global optimization and engineering problems inspired by heterosis theory

Article Open access 21 November 2024

Introduction

The immense increase in the volume of data facilitates us to analyze and understand business-related associations and patterns in the datasets. However, the analysis requires huge computation and storage capabilities which most of the data owners lack and avail third-party facilities. Further, collaborative pattern mining is another dimension that data owners like to explore to understand global patterns. Data sharing expands the possibility of high-utility pattern mining but also invites several privacy threats. The sharing of data may expose highly confidential information of an owner to others that she is not willing to share.

Privacy-Preserving Data Mining (PPDM) techniques mask sensitive patterns before sharing or analyzing data with a third party. These techniques are broadly divided into two categories i.e. knowledge-hiding techniques (KHT) such as pattern hiding and data masking, and Data Hiding Techniques(DHT) such as anonymization and encryption.

Frequent itemset mining lies under the domain of KHT where we mask sensitive information by decreasing the support count below some pre-defined threshold ($\delta _{min}$) by either deleting the item, replacing it with some foreign item, or deleting the entire transaction supporting sensitive information from the dataset. Therefore, the selection of optimum victim items as well as transactions for hiding the data is of utmost importance. The masking of some instances of information from the dataset affects the utility of the data. Maintaining an optimal privacy and utility ratio is an NP-hard problem¹. Many methods have been developed in the past for achieving different aims like simplicity, high speed, optimal solution, etc. The heuristic schemes provide high-speed and simple solutions but they possess a lot of side effects and low data utility. However, the schemes following exact approaches such as linear/integer programming provide an optimal solution but they involve huge computation and time complexity. Evolutionary-based schemes provide a mid-way path and generate a near-optimal solution with relatively low computation and time complexity.

Evolutionary-based schemes have been recently explored especially for Privacy Preserving Data Mining (PPDM). Many researchers are focusing on evolution-based techniques for masking confidential patterns. Lin et al. proposed Genetic algorithm-based techniques in² and Authors in³ proposed a Particle swarm optimization-based transaction deletion scheme for masking sensitive patterns. Further, Ant Colony-based schemes have been proposed in⁴ and⁵ to minimize the side effects caused by transaction deletion during data sanitization.

All of these algorithms are based on the deletion of transactions and incur a high loss of data. Further, they may contribute to hiding failure in the case of dense datasets which are defined as follows.

Definition 1

(Dense Dataset): The Density factor of a dataset D, $dens-fact(D)$, is the fraction of the number of $1^{s}$ in the transaction-item matrix of $D \text {(TID)}$ to the size of this matrix.

$$\begin{aligned} dens-fact(D)= \frac{\sum _{i=1}^{\mid D \mid }\sum _{j=1}^{\mid l \mid } \text {TID}[i,j]}{\mid \text {TID}\mid } \end{aligned}$$

(1)

where $\text {TID}$ is a transaction item matrix or binary form of a dataset D as shown in Tables 1 and 2. The presence and absence of any item are denoted by binary numbers 1 and 0 respectively. The support count of an item can also be determined by calculating the total number of $1^s$ in all the transactions for that item.

Table 1 Transactional dataset.

Full size table

Table 2 Binary transactional data.

Full size table

If support of a sensitive pattern is reduced concerning n transactions by deleting the transactions having that pattern may become frequent for reduced m number of transactions where $n > m$. Suppose ABC is an itemset and 5 is its support count, represented as (ABC,5). The minimum threshold value is $\delta _{min} =40\%$ with a total number of transactions as 10 i.e. $n=10$. ABC would be infrequent if we delete 2 transactions having ABC, then m would be 8, and support of ABC would be 3. If we again mine the dataset with the remaining 8 transactions keeping the same threshold of 40% the pattern (ABC, 3) becomes a frequent itemset and leads to hiding failure.

Furthermore, the hiding of all instances of sensitive frequent itemsets with a unique minimum threshold ($\delta _{min}$) is not an appropriate criterion for all of the possible scenarios. The reason of such an assertion is that, an sensitive frequent itemset $s_i$ is associated with certain attributes like length $len(s_i)$ and sensitivity $sens(s_i)$ values, which may affect its probability of mining in a given dataset. For example, an itemset whose frequency is high in a given dataset has a higher probability of getting mined than one with a few items. The degree of sensitivity is another important factor that is to be considered when deciding on privacy requirements. For example, an item whose sensitivity value is high, would contain highly sensitive attributes such that they must have a tight threshold value rather than an itemset with a lower sensitivity value. In⁴, the authors proposed the concept of multi-threshold using ant colony optimization for improving the privacy of different length-sensitive patterns. However, this dimension of choosing multi-threshold-based constraints in an evolutionary-based scheme should be more explored for diversified solutions. The key contributions of the proposed algorithm are as follows.

1.
We propose a novel sensitive pattern-hiding approach using a dynamic multi-dimensional threshold constraint determined with the help of two attributes i.e. length and the sensitivity values of the near-optimally selected victim items. It should be noted that most of the existing techniques consider only a single attribute, and they usually delete entire sensitive transactions which causes higher side effects and lower utility of sanitized datasets.
2.
A dynamic multi-dimensional threshold-based function denoted as $\delta _{dm}$ (as mentioned in Eq. 7) is proposed which is being determined using a bi-variate normal distribution function F(x, y). It determines a dynamic threshold value for each sensitive pattern-based on two crucial factors i.e. length and sensitivity values.
3.
We introduce the utility based privacy parameter denoted as $\mathcal{P}\mathcal{U}$ as shown in Eq. (11). This makes the proposed approach flexible by providing an opportunity for an end user to decide privacy and utility ratio according to the application and hence it makes the approach more practical and useful for different type of datasets.
4.
Unlike most of the existing schemes, the proposed approach tends to mask sensitive patterns according to their contribution towards pre-determined utility and privacy.
5.
The proposed scheme tends to improve the exploring capability of the proposed scheme using updated velocity and position equations and hence, promises a near-optimal solution for masking confidential information.
6.
In the proposed scheme, we are picking a different number of transactions for each victim item such that this number is determined with the help of a dynamic multi-threshold function.

The novelty of the proposed work lies in its ability to explore a broader range of dimensions and factors to identify an optimal set of transactions and victim items based on multiple, dynamically determined thresholds. These threshold values are evaluated through a proposed threshold function that leverages the bi-variate normal distribution, enhancing the model’s alignment with real-world applications and datasets. Unlike previous approaches, which often rely on fixed threshold values and consider a limited number of influencing factors, the proposed scheme accommodates multiple variables that can impact the data distortion process. This results in reduced side effects and improved performance. Additionally, the framework offers flexibility in balancing privacy and utility. The parameters can be tuned according to specific application requirements, allowing the scheme to adapt its behavior dynamically based on the adjusted values.

Motivation

In today’s era of data where most of our decisions are based on the analytical review of data patterns, we are also on the verge of a privacy breach. Therefore, it is important to conceal sensitive or confidential information before sharing our data. While considering confidential information hidden within datasets like healthcare, retail data, finance data, etc., it is important to explore Privacy Preserving Data Mining (PPDM) techniques. These techniques would be able to minimize the disclosure of sensitive information while understanding important associations and gaining knowledge at the global level.

PPDM schemes aim to distort sensitive frequent itemsets using a random process such as perturbation, addition, subtraction, replacement, etc. The sanitization is not an optimal solution as it induces side effects, affecting the utility and privacy of datasets like the generation of artificial patterns, hiding failure, missing cost, etc. The selection of the most relevant part of the dataset for mitigating these side effects is the primary goal of the sanitization process.

Most of the evolutionary computation-based schemes⁶ and⁷ have been developed to generate an optimal set of transactions or items that could be perturbed by masking sensitive patterns. All of these models used a unique minimum threshold value to hide sensitive information which may not be ideal for real-life scenarios. For example, in healthcare data, some sensitive diseases like HIV should have a tighter threshold than the symptoms of simple flu. So, the measurement of such diseases against a single threshold is not an ideal approach. Hence, we need a comprehensive solution for addressing the need for privacy, utility, and the dynamicity of the minimum support threshold.

Contribution

This paper addresses the problem of optimized victim item deletion-based sensitive pattern hiding using multi-dimensional dynamic threshold criteria. This approach contributes in the following directions:

Sensitive itemset categorization The sensitive itemsets are categorized based on the sensitivity of the itemsets. The attributes belonging to sensitive itemsets are sorted in terms of sensitive information possessed by them. If a sensitive itemset has Identifier attributes (defined in Definition 2) that directly reveal any sensitive information they are termed as privacy-demanding sensitive itemsets. These sensitive itemsets have higher priority and must be masked with a tight threshold. Further, a sensitive item with indirect attributes / quasi-identifiers (defined in Definition 3) possessing low sensitivity is considered a utility-possessing attribute. These attributes are treated with a loose threshold and used to improve the utility of the dataset.

Definition 2

(Direct Attributes):Direct attributes are keyed attributes which contain information that uniquely and directly identifies individuals such as full name and social security number. These are also called ‘Identifiers’.

Definition 3

(Indirect Attributes): The attributes combined with external data, lead to the indirect identification of an individual. These attributes are non-unique data such as gender, age, and postal code. These are also called as ‘Quasi-Identifiers’.

Dynamic multi-threshold calculation In place of a unique threshold value, the proposed scheme follows a multi-threshold constraint. Privacy-demanding and utility-possessing sensitive itemsets are further sorted individually in terms of their length and sensitivity.
Non-sensitive border (NSB) In the case of dense datasets, where the number of frequent patterns would be larger and hence we have to analyze a large number of itemsets to determine the impact of any modification on non-sensitive information. It is computationally expensive to handle such a large number of non-sensitive itemsets. We propose to find a border for these nonsensitive patterns. These border elements are representative of other non-sensitive itemsets and their count is low in the dataset. The analysis of such bordered non-sensitive itemsets reduces the complexity of determining the impact of any modification done by the proposed algorithm and it would help in improving the utility of the dataset.
High coverage victim item selection The proposed method considers the deletion of the optimum victim item rather than deleting the entire transaction supporting a sensitive pattern. Unlike traditional schemes, we propose to identify victim items with minimal side effects on non-sensitive borders and maximum impact on sensitive patterns. This further helps in enhancing the utility of sanitized datasets by reducing data loss.
Evolutionary transaction exploration An optimized set of transactions is explored using a Particle Swarm Optimization (PSO) based evolutionary algorithm for modification in the original dataset. The designed model is shown to be more effective in optimizing the required aim of hiding sensitive information with minimal side effects.

The experiments show that the designed PSO-based multi-threshold constraint-based sanitization model outperforms previous evolutionary-based models by providing high utility with the desired privacy constraint.

Further, the paper is organized as follows. In section "Related work", a literature review of evolutionary-based privacy-preserving data mining techniques is discussed. Section "Preliminaries and problem statement" defines the preliminaries and problem statement. In section "Proposed sensitivity and length based multi threshold victim deletion based datasanitization scheme, the proposed approach is discussed in detail. Section "Experimental results" presents the experimental setup and results and finally, in section "Conclusion" the conclusion is drawn.

Table 3 represents the basic abbreviations used in this work for better reading and understanding of the article.

Table 3 Important abbreviations used in proposed work.

Full size table

Related work

Recently many research contributions have been made by researchers for hiding sensitive knowledge in different domains such as privacy protection problems in rare pattern mining⁸, where network vulnerability detection and abnormal medical data, is undertaken. The authors have proposed two algorithms named LT-MIN and LT-Max to hide sensitive rare itemsets and the side effects on the original database. In⁹, the authors propose three heuristic-based algorithms, namely Selecting the Most Real item sensitive utility First (SMRF), Selecting the Least Real item sensitive utility First (SLRF), and Selecting the most Desirable Item First (SDIF), to efficiently conceal all Sensitive High utility Itemsets (SHIs) while mitigating the projected detrimental impact on the non-sensitive information. Further, in¹⁰, the authors propose a parallel evolutionary Privacy-Preserving Data Mining (PPDM), called High-performance Evolutionary Data Sanitization for IoT (HEDS4IoT), and implement two mechanisms on a Graphics Processing Units (GPU)-aided parallelized platform to achieve real-time streaming protected data transmission. The first mechanism, the Parallel Indexing Engine (PIE), generates retrieval index lists from the dataset using GPU blocks. Further, in¹¹, the Enhanced Elephant Herding Optimization Algorithm for Association Rule Hiding (EEHOA4ARH) is proposed for association rule hiding. In EEHOA, two core functions as clan updating operator and separating operator are used for association rule hiding that also realizes the fast convergence and exploitation capabilities. To reduce the time consumption for the selection of the best solution, a Crowding Distance (CD) concept is combined with EEHOA4ARH. By continuously updating the best elephant and replacing the worst elephant in the population, EEHOA4ARH-CD sanitizes the transaction database effectively. Certain other methodologies^12,13 which have been proposed for other objectives like Semantic-aware dehazing network for feature fusion and Multi-Scale Three-Path Network^14,15 for segmentation are been utilized in other research areas and can be explored as the solution for sensitive data segmentation and fusing important features for data privacy preservation.

All of the approaches discussed show a positive contribution to hiding sensitive information in several domains. It can be depicted that hiding sensitive knowledge is one of the crucial points of concern in today’s era of data. Most of these recent schemes are using different strategy to hide sensitive knowledge with the objective of reduced side-effect, computation time, etc. However, the sensitive pattern hiding scheme needs more attention with multiple objectives of reduced side-effects, computation as well as improved utility.

Here we present a review of some evolutionary-based important algorithms that contributed to PPDM. The evolutionary-based schemes use nature-inspired approaches like Genetic Algorithm (GA)¹⁶, Particle Swarm Optimization (PSO)¹⁷, Ant Colony Optimization (ACO)¹⁸, etc. which provide a near-optimal solution by iteratively exploring search spaces using several strategies⁶. In¹⁹, authors proposed a GA-based protocol for mining rules at third parties, for the data owners owning their private data while dealing with any privacy breach. All the rules are evaluated using a proposed joint fitness function. Lin et al. proposed various algorithms like sGA2DT², pGA2DT², and cpGA2DT³ using GA for masking confidential patterns by removing supporting transactions. A fitness function is proposed to determine the effect of any alteration in terms of three side effects. In^19,7 the authors studied extensions of evolutionary approaches. All of these approaches help in exploring the optimal transactional set for deletion, but, the weights for each side effect need to be pre-determined which greatly affects the resultant sanitized dataset. To overcome this effect, EMO-RH²⁰ approach is proposed in which, missing information is produced and sensitive itemsets are deleted from the explored transactions. In¹⁷ the authors have explored particle swarm optimization (PSO) for having the desired sanitized dataset. Initially, Bonam et al. studied PSO for rule hiding and proposed RSIF-PSOW²¹ which distorts the binary dataset and turns the frequency of sensitive itemset from 1 to 0. Further, another algorithm called PSO2DT⁷ is proposed by Lin et al. to hide sensitive patterns by removing entire sensitive transactions. The set of transactions is explored using PSO while N particles are initiated representing the candidate solution. In every iteration, the velocity and position of a particle are updated to improve the exploration capability. It is being observed that the use of PSO improved the overall computation as in GA-based approaches continuous tuning of parameters for mutation and crossover is required which is not required when working with PSO-based methodology. Furthermore, another clustering-based scheme was proposed in²². The author proposed selecting a set of transactions with a minimum effect on non-sensitive but a maximum on sensitive information. Clusters of sensitive patterns are explored using hierarchical clustering and Jaccard similarity measures.

Apart from PSO, Ant colony optimization (ACO) is another nature-inspired optimization technique that is explored for the same objective. Wu et al. proposed a scheme named ACO2DT⁵ for minimizing the side-effect of deleting the records. A graph is created where each node represents sensitive transactions. In every iteration, an ant takes a tour and evaluates the set of transactions that need to be deleted. A heuristic function has been proposed for selecting the next set of transactions. A cuckoo optimization scheme has been used by Afshari et al.²³ and is called as COA4ARH algorithm. The algorithm deletes sensitive itemsets from the supporting transactions. The side effects of any distortion are evaluated using three different fitness functions. Another evolutionary scheme named Social Group Optimization (SGO)²⁴ is used for hiding sensitive patterns from selected transactions. Items with maximum impact on sensitive items are selected as victim items and those items which are having the highest frequency are deleted from the set of transactions.

All the techniques that we have discussed above use single-objective optimization approaches. These schemes do not provide an optimal solution because of the orthogonal relationship between the side effects. Cheng et al.²⁵ propose a multi-objective evolutionary scheme combining knowledge and data distortion for hiding sensitive rules. However, the quality of the sanitized dataset is not appropriate for some applications. Another GA-based multi-objective scheme was introduced by Lin et al.²⁶ named NSGA2DT. The approach uses database dissimilarity as a fourth factor for evaluating the chromosomes’ performance. VIDPSO is proposed²⁷ which uses PSO to determine a near-optimal set of transactions. Here a victim item with a minimal effect on non-sensitive itemset based on its participation in confidential information is selected and further deleted from the supporting transactions. A density Clustering based approach called CMPSO⁶ and a grid-based multi-objective optimization scheme (GMPSO)²⁸ are another two dimensions explored for the aim of data sanitization. Although there are several benchmark contributions by existing evolutionary PPDM techniques, there are several challenges which are as follows:

Most of the EB-PPDM schemes tend to delete the entire sensitive transaction from the dataset. This causes lower utility as a large portion of the dataset gets removed from the original datasets.
The existing approaches hide the sensitive patterns against a static minimum threshold. This may not be an appropriate choice for some applications like healthcare as other factors like length, sensitivity may also affect the probability of any pattern getting mined.
The EB-PPDM-based approaches aim to delete some information until all of the sensitive information gets masked. However, there is a tradeoff between utility and privacy. Only a few approaches attempted to maintain this ratio as most of them tend to hide all sensitive knowledge and achieve complete privacy without taking care of the utility of the resultant sanitized dataset.

All of the literature discussed above is categorized and presented in Figure 1.

Preliminaries and problem statement

In this section, we briefly define the basic definition of the hidden sensitive information in terms of market basket-dense datasets.

Preliminaries

Let $\mathcal {D}=\{T_1, T_2,\ldots ,T_d\}$ be the original dataset having d transactions. Let $\mathcal {A} = \{ A_1, A_2, ... , A_m\}$ be the set of attributes. These attributes can be classified into three categories which are as identifiers, quasi-identifiers, and normal attributes. The attributes which are directly revealing some sensitive information are termed identifiers like ‘ID’. Quasi-identifiers imply that they do not reveal any sensitive information directly but can be aggregated or linked with other attributes to reveal some sensitive information. The attributes which are not participating in any sensitive information are termed as normal attributes.

An Itemset I is a collection of attributes $I =\{A_i A_j ... A_k\}$ having some interesting information. A minimum threshold ($\delta _{min}$) function is assigned by the user initially to mine the high support patterns in dataset $\mathcal {D}$. Further, a dynamic threshold function ($\delta _{dm}$) presented in Eq. (7) is evaluated to generate dynamic threshold value for itemsets with varying lengths and sensitivity values. The minimum threshold $\delta _{min}$ is considered to avoid the generation of any low support pattern as a frequent pattern in the dataset. For example, consider a dataset D having 10 transactions as shown in Table 4. The frequent patterns discovered from D with minimum support threshold $\delta _{min} = 40\%$ are shown in Table 5.

Table 4 Original dataset.

Full size table

Table 5 k-frequent Patterns at $\delta _{min}=40\%$ where k represents the length of a pattern.

Full size table

Definition 4

(Frequent Itemset): $F= \{f_1, f_2,\ldots ,f_k\}$ is the set of frequent itemsets where for each $f_i \in F,$ and $sup(f_i) \ge \mid D\mid \times \, \delta _{min}$ is the minimum support count. $sup(f_i)$ is the support count of $f_i$, that is total number of transactions having itemset $f_i$.

Definition 5

(Sensitive Pattern): An itemset I is a sensitive pattern. if it is frequent and exhibits some sensitive information that should be masked before data sharing.

Definition 6

(Sanitized Dataset): A Dataset $D' \subseteq D$ is called a sanitized dataset if it does not contain any sensitive information i.e. for any sensitive itemset $s \in S$ and $\; s_i \notin D'$. We obtain a sanitized dataset $D'$ by deleting some patterns or records from the original dataset D.

$$\begin{aligned} D'= D- S \end{aligned}$$

(2)

Privacy-preserving data mining techniques tend to mask sensitive patterns by deleting some information i.e. sensitive patterns from the original dataset. The deletion of information causes certain side-effects such as misses cost, artificial patterns, and utility loss. The details of each of them are listed below:

Definition 7

(FTH): Failure to hide side effects holds when there exist some instances of sensitive patterns ($s \in S$) in sanitized dataset $D'$. The degree of FTH is denoted by $\alpha$.

$$\begin{aligned} \alpha = S \cap F \end{aligned}$$

(3)

Definition 8

(NTH): Not to hide side effect implies that a set of non-sensitive frequent patterns i.e. $NS \subseteq F$ get masked accidentally after the sanitization process. It is denoted by $\beta$

$$\begin{aligned} \beta = F - S - NS \end{aligned}$$

(4)

Definition 9

(NTG) Not to generate side effect implies when during sanitization process, some artificial patterns get generated while mining $D'$ and they are originally not present in Dataset D. It is denoted by $\gamma$.

$$\begin{aligned} \gamma = F' - F \end{aligned}$$

(5)

The overall illustration of relationships between the original dataset D, sanitized dataset $D'$, and the side effects are shown in Figure 2.

To hide sensitive patterns, we need to mitigate the discussed side effects. Before deleting any information. We assess its impact on the resultant dataset which is a computationally intensive task. We would be able to achieve this task easily with the assessment of such modifications on the border itemsets.

Definition 10

(Non-sensitive Positive Border($Bd^{+}(NSB)$) Itemset): Let NS be a collection of non-sensitive frequent itemsets and P the lattice of itemsets. The border $Bd^{+}(NSB)$ of NS consists of all non sensitive frequent itemsets in NS such that

$$\begin{aligned} Bd^{+}(NSB)= \{ns_x \in NSB \mid \forall ns_y \in P \;with \; ns_x \prec ns_y \;then \;ns_y \notin NSB \end{aligned}$$

(6)

For example, consider a set of non-sensitive frequent itemsets $NS=\{a,b,c,e, ab, bc, be, ce, bce \}$ generated from the Table 4. The positive border of the non-sensitive frequent itemset i.e. $Bd^{+}(NSB)=\{bce,ab\}$. Note that further we will be using NSB as $Bd^{+}(NSB)$.

Definition 11

(Safe State): Suppose a $j^{th}$ instance of non-sensitive itemset $ns_j$ in NSB with support value of x i.e. $sup(ns_j)=x$. Then $ns_j$ is considered to be in a safe state if and only if, after deletion of an item i belonging to $ns_j$ from the supporting transactions T, $ns_j$ remains frequent in the resultant dataset. This ensures that the item is not fully dependent on the considered set of transactions and can still appear frequently in the produced sanitized dataset i.e.

$$ns_j \; is \; \text{ safe }\; =\; \{\text {iff} \,currentsup(ns_j) = (originalsup(ns_j) - \Delta {\text {diff}}) > \delta _{min}$$

wher, $\Delta {\text{ diff }}$ is the number of times an item is supposed to be deleted for masking. $currentsup(ns_j)$ is the support count of $ns_j$ after sanitization and $originalsup(ns_j)$ is the support of $ns_j$ before sanitization.

Problem statement

In this work, we propose the following steps for sanitizing data with improved utility.

Step 1:
Categorize the sensitive itemsets for two attributes that are length and sensitivity as discussed in subsection "Sensitivity and length evaluation". The attributes with higher sensitivity are considered to be more important and need to be masked with a tight threshold value.
Step 2:
Finding a non-sensitive border as discussed in Definition 10, concerning non-sensitive frequent itemsets. This will help in reducing the number of itemsets to be considered for evaluating the impact of any modifications that may have occurred during the sanitization process.
Step 3:
Sensitivity and the length-based threshold are determined for each sensitive itemset ($s_i \in S$) as explained in Section "Dynamic-multi-threshold calculation". We are referring to this threshold as a dynamic multi-threshold.
Step 4:
Identification of Victim item $v_i$ for each sensitive itemset $s_i$ with minimal side-effect over the non-sensitive border and maximal over sensitive information as explained in Section "Near optimal victim item identification".
Step 5:
PSO-based near-optimal set of transaction identification having a minimum fitness value. The detailed explaination is presented in Section "Transaction exploration and sanitization".
Step 6:
Deletion of victim items from explored transactions for sanitizing the dataset as explained in subsection "Our approach".

A mathematical illustration of the problem is represented as follows: For a Dataset $\mathcal {D}$, the proposed approach determines a victim item corresponding to each instance of sensitive itemset $s_i$ (where $s_i \in S$) having a significant impact on sensitive data but minimal effect on non-sensitive itemsets. Further, for each $s_i$, a dynamic multi-threshold $\delta _{dm}$ is determined based on its length and sensitivity. The PSO-based evolutionary process explores an optimum number of transactions by minimizing the fitness value in accordance with Eq. (12), to mask sensitive patterns and produce a sanitized dataset $\mathcal {D}'$. The overall working of the proposed framework is presented in Fig. 3. The first block in Figure named as data center provides the original dataset to the pre-processing block. This block separates the non-sensitive data from the sensitive portion and forwards the non-sensitive data to the third processing block. This separation helps reduce the dataset size, minimizing the number of scans required for victim identification and transaction set evaluation. Subsequently, the third block generates multi-threshold values for the sensitive itemsets and identifies the optimal victim items and corresponding transactions for modification. Finally, the sanitization block applies Particle Swarm Optimization (PSO) to remove sensitive information and produce a sanitized dataset. Each of these blocks is discussed in detail in Section Proposed sensitivity and length based multi threshold victim deletion based datasanitization scheme.

Conventional methodology used by existing EB-PPDM

In this section, we discuss the approaches followed by the latest benchmark schemes for achieving a similar goal of masking sensitive patterns using evolutionary-based schemes. ACS2DT and PSO2DT are proposed in⁵ and⁴ respectively to explore the near-optimal set of transactions concerning the sensitive patterns with the help of the fitness function presented in Eq. (12). Further, both of the schemes determine multi-threshold based on length as presented in Eq. (9). For each sensitive pattern with a smaller length is masked against a loose threshold value whereas the item with a higher length is masked against a tighter threshold. Furthermore, both of the schemes attempt to delete some of the explored transactions for masking sensitive patterns.

VIDPSO proposed in²⁷ is another approach that uses PSO for exploring sensitive transactions using a similar fitness function as used by ACS2DT and the PSO2DT. In this approach, authors initially determine the victim item against each sensitive itemset and then delete it from the explored set of transactions. VIDPSO makes a lower number of modifications in comparison to ACS2DT and PSO2DT as it deletes only victim items rather than a set of transactions. However, VIDPSO hides sensitive patterns against a static threshold value.

Most of the existing approaches focus on privacy as a sole aim of sanitizing the dataset, however, utility and data loss are also important factors that should be considered for developing a masking scheme. In this work, we propose an application-specific approach that is flexible (tuning of privacy and utility requirements) and a multi-objective approach that can handle privacy, utility, and data loss with better exploration capabilities.

Proposed sensitivity and length based multi threshold victim deletion based data sanitization scheme

This section presents a detailed discussion of the proposed sensitivity and length-based dynamic multi-threshold victim deletion-based scheme for data sanitization. The proposed scheme consists of two phases that are near-optimal victim item selection and data sanitization. Initially, we categorized sensitive patterns based on the length and sensitivity and explored the near-optimal victim item by determining their effect on border nonsensitive itemsets. Further, we attempt to improve the exploration capability of PSO by adding randomness in velocity and position for exploring near an optimal set of transactions against a sensitive itemset. Further, the population and size of the sub-particle are evaluated based on the dynamic multi-threshold with respect to the length and sensitivity of the sensitive pattern. The approach is tunable in accordance with the privacy and utility-based requirements of the application. The detailed working of the proposed scheme is discussed in the following subsections.

Sensitivity and length evaluation

For each sensitive pattern, sensitivity and length are evaluated. For example, consider a set of sensitive itemsets $S=\{abc, cd, be\}$ such that the total number of sensitive items derived from the considered sensitive items are $\theta =\{ a, b, c, d, e \}$. For each $x\in \theta$, count the number of sensitive patterns where x is a subset of an element of S. For example, $a\in \theta$ and it is a subset of $abc\in \, S$. Hence, the sensitivity value of item a is 1. Similarly, $b\in \,\theta$ and it is a subset of $abc\in \,S$ and $be\in \,S$. Hence, the sensitivity value of b is 2. The sensitivity value of the rest of the sensitive items are being determined as above and these values are shown in Table 6.

Table 6 sensitivity calculation of item belonging to sensitive itemsets.

Full size table

Once the sensitivity value with respect to each item is evaluated, the sensitivity of itemset $s_i\in S$ is determined. For example, $abc \in S$ is having items $\{a, b, c\}$. We have to add the sensitivity of each item as $\{1+2+2 =5\}$. Hence the sensitivity of abc is 5. Further, for sensitive itemset $cd \in S$, the sensitive items are c and d. Hence, sensitivity value of this cd is $\{2+1=3\}$. Further, for each sensitive itemset, length is evaluated as a count of items belonging to a considered sensitive pattern. For example, $abc\in S$,the number of items are $\{ a, b, c\}$. Hence, the length of abc is 3. The length and sensitivity value of the rest of the sensitive itemsets is shown in Table 7.

Table 7 Length and sensitivity calculation for each sensitive pattern.

Full size table

After evaluation of the length and the sensitivity values of each sensitive itemset, the positive border of the non-sensitive itemsets (NSB) is evaluated as discussed in Definition 10. Further, the threshold value is to be determined for each of these sensitive itemsets which is discussed in the next subsection.

Dynamic-multi-threshold calculation

In this work, we are considering the maintenance of both privacy and utility ratios. Therefore, we considered two parameters i.e. length and sensitivity for evaluating the dynamic threshold $\delta _{dm}$ for pattern hiding. In this article, the bi-variate normal distribution function as shown in Eq. (7) is used to determine the multi-threshold-based dynamic threshold for the varying length and sensitivity values of patterns. The reason for using this distribution function is that a bi-variate normal distribution function can describe a real-life scenario. Of course, it can be adjusted to any distribution function in accordance to the data distribution. In this work we have chosen to use normal distribution as the benchmark datasets used in this work are more relevant to it.

$$\begin{aligned} f(x,y)=\frac{1}{2\pi \sigma _X \sigma _Y \sqrt{1-\sigma ^2}} exp \left( -\frac{1}{2[1-\rho ^2]}\left[ \left( \frac{x-\mu _X}{\sigma _X}\right) ^2-2\rho \left( \frac{x-\mu _X}{\sigma _X}\right) \left( \frac{y-\mu _Y}{\sigma _Y}\right) + \left( \frac{Y-\mu _Y}{\sigma _Y}\right) ^2\right] \right) \end{aligned}$$

(7)

where x is the length and y is the sensitivity value of the pattern. $\sigma _x$ and $\sigma _y$ are the standard deviation with respect to the length and sensitivity value and $\rho$ is the correlation between length and sensitivity value. With the help of given distribution function $\delta _{dm}$ is determined as follows:

$$\begin{aligned} \begin{array}{l} F(x,y) = \frac{\delta _{min}}{f(1,1)} \times f(x,y) \quad Where, \\ \\ \delta _{dm} = \Biggl \{ \begin{array}{l} F(x,y),\quad if\; F(x,y) > \delta _{min} \\ \delta _{min}, \quad Otherwise \end{array} \end{array} \end{aligned}$$

(8)

where f(1, 1) denotes the joint probability distribution of a pattern whose length as well as sensitivity value is 1 and $\delta _{min}$ is the static minimum threshold.

We have observed some interesting scenarios of patterns exhibiting different behavior in terms of lengths and sensitivity values which are as follows:

Patterns with length and sensitivity values are not same as the average length of the pattern and average value of sensitivity, then the threshold should be evaluated by using multi-parametric bi-variate Eq. (8). This helps in evaluating the threshold concerning the impact of both parameters for masking sensitive information.
If the length of the pattern is equal to the average length, then it is considered that the sensitivity value of the patterns is a univariate normal distribution as shown in Eq. (10).
$$\begin{aligned} {\begin{matrix} F(n)= {\frac{\delta _{min}}{f(1)}} \times f(n_s)\\ \delta _{dm}=\biggl \{ \begin{array}{cc} F(n), & \quad F(n) > \delta _{min} \\ \delta _{min},& \quad Otherwise\\ \end{array} \end{matrix}} \end{aligned}$$
(9)
where $f(n_s)$ and f(1) are the probability density function with respect to the sensitivity value of a pattern as $n_s$ and 1. Here, we have assumed that the mean value of the distribution is 1.
If the sensitivity value is equal to the average value of sensitivity then it is considered that the length of the patterns is a univariate normal distributed as shown in Eq. (9).
$$\begin{aligned} {\begin{matrix} F(n)= {\frac{\delta _{min}}{f(1)}} \times f(n_l)\\ \delta _{dm}=\biggl \{ \begin{array}{cc} F(n), & \quad F(n) > \delta _{min} \\ \delta _{min},& \quad Otherwise\\ \end{array} \end{matrix}} \end{aligned}$$
(10)
where $f(n_l)$ and f(1) are the probability density function with respect to length of a pattern as $n_l$ and 1. Here, we have again assumed that the mean value of the distribution is 1 for considering the real-life scenarios.
Patterns with length and sensitivity both equal to their average value length and an average value of sensitivity respectively, then the patterns require a strict threshold value. Hence, the minimum threshold value $\delta _{min}$ should be considered for masking purposes.

Near optimal victim item identification

The victim itemsets are identified by considering their impact on the sensitive and non-sensitive patterns. For each item x belonging to a sensitive pattern $s_i$, two lists are formulated. The first list is the positive impact list $P.list_x$, which represents sensitive patterns having x. The second list is the negative impact list called $N.list_x$ holding the list of non-sensitive frequent patterns belonging to the non-sensitive positive border (NSB) which may get masked by deleting x. A data structure called $\mathcal{C}\mathcal{V}$ is created which comprises $(x, P.list_x, N.list_x)$ for each item $x \in s_i$. For example, consider a set of frequent patterns $F=\{bce, ab, bc,ce, be\}$ where bce and ab are the sensitive patterns. $\mathcal{C}\mathcal{V}$ with respect to bce is shown in Table 8.

Table 8 CV holding item and corresponding lists.

Full size table

For selecting the victim item, the items in $\mathcal{C}\mathcal{V}$ are to be first sorted in increasing order of the number of non-sensitive patterns in N.list. Further, we are considering the following criteria for selecting the near-optimal victim items:

Criterion 1: Select x if $N.list_x= \phi$ otherwise, check Criterion 2.
Criterion 2: Sort each N.list corresponding to an item in increasing order of the support of non-sensitive items present in the list. If the minimal element of the $N.list_x$ is in a safe state (as explained in Definition 11) then select x as a victim otherwise, check Criterion 3 for the unsafe states.
Criterion 3: Sort all items $x \in s_i$ with respect to the length of their respective P.list i.e. the number of sensitive patterns that would get masked by deleting x. Select item x if the corresponding P.list length is maximum.

The proposed scheme of selection of victim items as we have discussed above is presented in Algorithm 1. This methodology of the selection of victim items ensures the high utility of the dataset and it also minimizes the side-effects of hiding the data over non-sensitive patterns.

Transaction exploration and sanitization

In this section, we discuss Particle Swarm Optimization and our approach in detail.

Particle swarm optimization:

The particle swarm optimization is based on the exploration of search space by the entities that are participating in this process. For example, Fig. 4 shows this behavior with the help of a flock of birds that are searching for food in some search space. Here, each bird explores this search space and collects the information with the help of two variables. The first variable is the best solution achieved by a bird and the second variable is the global solution assessed by their whole flock. In this way, the bird is continuously exploring the search space by modifying its position, and, based on the solutions noted by it, it also changes its velocity. With the help of this idea, they achieve the final aim. We are taking the same idea in our work such that each bird is being represented by a particle (or candidate solution) and the findings by them as personal best $p_{best}$ and global best $g_{best}$ solutions. This process iteratively evolves for a specified number of potential solutions. After each iteration, the velocity and position of each particle are updated to achieve the desired solution.

By taking the help of notations used in the previous sub-section, we are explaining the working of particle swarm optimization for our aim in the subsequent text.

Our approach

For each victim item $v_i\in V.list$, a near-optimal set of transitions $\mathcal {T}= \{T_1, T_2,...T_z\}$ (where z represents the number of explored transactions) is identified using the evolutionary PSO based scheme. PSO starts by initiating M number of particles $P=\{P_1, P_2,\ldots , P_M\}$. Each particle in the population represents the candidate solution of the search. Each particle $P_i$ is subdivided into N number of sub-particles i.e. $\{sp_{i1}, sp_{i2}, sp_{i3},\ldots , sp_{iN}\}$ where N represents the total number of victims or sensitive items that must be masked. The sub-particle contains the identifiers of those transactions that could be modified. The significance of the size of a sub-particle is defined as follows.

Definition 12

(Size of subparticle): represents the total number of transactions that must be modified to reduce the support count of corresponding sensitive itemset $s_i$ ($sup(s_i)$) below the determined threshold value (i.e. $\delta _{dm}\,\times \mid D\mid$). This is being evaluated using Eq. (11).

$$\begin{aligned} m_j = \{\mathcal{P}\mathcal{U} \times (argmax\{sup(s_i)\in \mathcal {S}\})-\delta _{dm} \times \mid \mathcal {D}\mid + 1\} \end{aligned}$$

(11)

where $m_j$ represents the size of subparticle $sp_{ij}$, $\mathcal{P}\mathcal{U}$ is an adjusting parameter that is being used to adjust privacy and utility requirements such that $\mathcal{P}\mathcal{U} \in$[0,1]. If $\mathcal{P}\mathcal{U} \rightarrow 0$, then it represents high utility but no privacy, otherwise, if $\mathcal{P}\mathcal{U} \rightarrow 1$, it represents maximum privacy but reduced utility. This helps in inducing flexibility in the sanitization of a dataset in real-life scenarios. In this work, the size of each sub-particle varies as each sensitive item has a different threshold value i.e. $\delta _{dm}$. In experiments, we have taken the minimum size of a sub-particle as 1 (detailed discussion in Annexure A ) because a high value of $\delta _{dm}$, as determined in Eqs. (8), (9), and (10), sometimes generates a negative value of $m_j$ in Eq. (11).

The fitness value of each sub-particle is calculated to explore the best set of transactions for deleting a victim item $v_i$ for hiding a sensitive pattern.

Definition 13

(Fitness Value): The fitness value is evaluated with the help of side-effects defined in Definition 7, Definition 8, and Definition 9.

$$\begin{aligned} f(P_i)= w_1* \alpha +w_2*\beta + w_3*\gamma \end{aligned}$$

(12)

where, $w_1$, $w_2$ and $w_3$ are the user-defined weights that could be adjusted by the respective side-effects such that $w_1+w_2+w_3 =1$, $f(P_i)$ represents the fitness value of particle $P_i$, and $\alpha , \; \beta , \; \text{ and } \; \gamma$ is the earlier discussed side-effects. This value shows the optimality of the explored set of transactions. The smaller fitness value of the transaction set shows low side effects if chosen for modifications. In our experiments we kept $w_i =0.8$ and $w_j\, = \,w_k\, = \, 0.1$. Finally, the explored transaction set with respect to minimum fitness value showing minimal side effects and maintaining high utility, is selected to delete the victim item and hide confidential patterns.

To improve the exploration capability of the approach, it is assured that no transactions occurring in an explored transaction set ET concerning victim $v_i$, are included in any other sub-particle existing in the population at the instance. This improves the exploring potential of the approach. The set of transactions selected for deleting the victim item after each iteration is then updated using Eqs. (13) and (14) representing velocity and position respectively.

$$\begin{aligned} v_{ij}(t + 1)= & n_1 * \,rand(P_{best_{ij}} - sp_{ij})\cup \,n_2 * \, rand(G_{best_j} - sp_{ij} )\cup \, n_3 *\, rand(ET) \end{aligned}$$

(13)

$$\begin{aligned} sp_{ij}(t + 1)= & v_{ij}(t + 1)\cup n_4 * rand(sp_{ij}(t)) \end{aligned}$$

(14)

The velocity and position of sub-particles represent the set of transactions selected for deleting the identified victim item. In Eqs. (13) and (14), $n_1$ denotes local search, $n_2$ is global search, $n_3$ is exploration capability and $n_4$ represents inertia. These are positive integers that denote the number of random transactions selected from their respective sets.

$$\begin{aligned} n_1= & \Bigg \{ \begin{array}{ll} \lfloor 0.3 \times m_j\rfloor ,\quad if \lfloor 0.3 \times m_j \rfloor <= \mid p_{best_{ij}}-p_{ij}\mid \\ \mid p_{best_{ij}}-p_{ij}\mid , \quad else\\ \end{array} \end{aligned}$$

(15)

$$\begin{aligned} n_2= & \Bigg \{ \begin{array}{ll} \lfloor 0.3 \times m_j\rfloor ,\quad if \lfloor 0.3 \times m_j \rfloor <= \mid g_{best_{j}}-p_{ij}\mid \\ \mid g_{best_{j}}-p_{ij}\mid , \quad else\\ \end{array} \end{aligned}$$

(16)

$$\begin{aligned} n_3= & \Bigg \{ \begin{array}{ll} \lfloor 0.3 \times m_j\rfloor ,\quad if \lfloor 0.3 \times m_j \rfloor <= \mid ET\mid \\ \mid ET\mid , \quad else\\ \end{array} \end{aligned}$$

(17)

$$\begin{aligned} n_4= & m_j-n_1-n_2-n_3 \end{aligned}$$

(18)

In experiments, we kept 30-30% of the size of the sub-particle intended to be filled from $P_{best_{ij}} - sp_{ij}$, $G_{best_j} -sp_{ij}$ and ET if their sizes are large enough. The remaining part of the subparticle $sp_{ij}(t + 1)$ is filled from $sp_{ij}(t)$. This percentage value can be adjusted by the user accordingly to explore new solutions. The values of $p_{best}$ and $g_{best}$ are updated by comparing the fitness value of newly formed particles to the current particle. The transactions present in sub-particles of $g_{best}$ are being used for removing the victim item $v_i$ and hence the support of $s_i$ is reduced by modifying those transactions. The algorithmic loop checks for termination conditions for the fulfillment of the final goal of the masking procedure.

Definition 14

(Termination Condition): The loop needs to be repeated until all of the sensitive itemsets get masked or all the victim items get deleted i.e.

$$\begin{aligned} check: \quad If (sup(s_i) <(\delta _{dm}(s_i)\,\times \mid D\mid )\; Then\; S&= \{S-s_i\}\; \\ Endloop:\quad If (\mathcal {S}&= \phi ) . \end{aligned}$$

The termination condition is satisfied if the set of sensitive itemsets becomes empty i.e. $\mathcal {S} = \phi$. This is achieved by decreasing the support count sup(s) of each sensitive itemset $s_i$ below $\delta _{dm}\,\times \,\mid D\mid$. Once the support value goes below the threshold, the corresponding sensitive item is removed from the sensitive set S.

The final set of modified transactions with no instances of sensitive patterns would be added to the sanitized dataset

Experimental results

All the experiments are conducted to show the significance of the proposed scheme over previous benchmark schemes named PSO2DT, ACS2DT, and VIDPSO. These three approaches have been selected for comparison as they are more recent and have already shown superiority over previously existing results. All of these methods have been implemented using MATLAB and performed over the Windows 10 operating system. The approaches have been compared over transactional and medical datasets as discussed in Table 9. The Mushroom and Chess datasets are dense transactional datasets and HD and HAP are healthcare datasets.

Table 9 Datasets and their characteristics.

Full size table

We have evaluated the performance in terms of fitness value, privacy and utility achieved, and data loss after the sanitization process. Each of these deliverables is discussed in detail as follows.

Fitness values

The fitness function determines the closeness of the solution to the desired objective. We compare the different benchmark schemes with the proposed approach regarding the fitness function. In the experiments, $w_i$ is kept higher for the sensitive patterns with high privacy requirements, and $w_j$ is kept high for maintaining the utility of the data. The trade-off between both of them is maintained to get the unified result of privacy and utility. The value of $w_k$ is kept low as no artificial pattern gets generated using the proposed scheme.

Initially, Fig. 5 represents a comparative view of the proposed scheme and conventional schemes corresponding to the fitness value achieved by these schemes. We have conducted different sets of experiments with four datasets chess, mushroom, Heart Dataset (HD), and Heart Attack Prediction (HAP) dataset, at varying minimum threshold values and percentages of sensitivity values. We observe that in sub-figures a, b, c, and d of Fig. 5, the fitness value gained by the proposed scheme is significantly lower than other comparative schemes. This is due to the more exploring capability of the proposed approach that is achieved using updated velocity and position equations. This enhanced exploration capability helps create solutions causing minimal side-effects and thus lowering the fitness value. ACS2DT and PSO2DT schemes lack these results due to the lower capability of exploration. However, VIDPSO also achieves comparatively better fitness value due to its capability of exploration of transactions.

Figure 6 shows the outcome of another experiment conducted with varying percentages of sensitive itemsets concerning minimum support threshold value over benchmark datasets. It represents a comparative view of fitness value achieved with varying percentages of sensitive itemsets. The graphs clearly show that the updated equations of velocity and position in the proposed algorithm help in achieving significantly lower fitness values and hence transactions are being modified in a better way.

Privacy and utility

Further, we have also compared the proposed scheme with benchmark multi-threshold based (ACS2DT) and static threshold (PSO2DT and VIDPSO) based approaches in terms of two side-effects i.e. FTH and NTH with varying percentages of sensitive patterns and minimum threshold value ($\delta _{min}$).

1.
Figure 7 shows the performance of the proposed scheme in terms of FTH value with varying minimum support thresholds. It represents the number of sensitive patterns that are not hidden after completing the privacy preservation algorithm. Hence, we are able to deduce that the higher values of FTH indicate there is a high breach of privacy. The higher the FTH value more the privacy breach as sensitive information is present in the sanitized dataset. It also shows that the proposed scheme performs significantly better than PSO2DT and ACS2DT, however, the FTH value is higher in the proposed scheme in comparison to VIDPSO. This is because the proposed scheme follows the dynamic threshold phenomenon. It implies that the few sensitive patterns have a loose threshold value which in turn chooses not to hide a pattern completely. This leads to a higher value of FTH. Since VIDPSO has a fixed threshold value, it masks all sensitive patterns before termination. This achieves a greater percentage of privacy but the utility of the dataset is reduced because all sensitive itemsets are treated equally. Due to the dynamic threshold phenomenon, this would not happen in the proposed scheme. Therefore, the proposed scheme possesses a high FTH value to support the better utility of the dataset.

Additionally, we have also determined results with respect to varying percentages of sensitive itemsets. The plots presented in Fig. 8 show the FTH value achieved by different benchmark schemes over respective datasets with varying percentages of sensitive item sets. The result clearly shows that the proposed scheme generates better results in comparison to ACS2DT and PSO2DT.
2.
In Figs. 9 and 10, we are presenting another set of results concerning NTH value. Here we observe that the proposed scheme generates a lower value of NTH in comparison to the basic ACS2DT, PSO2DT, and VIDPSO. The reasons are presented as follows.
- We are not removing the entire records but only the identified victim items. This helped in preserving the non-sensitive data that is present in considered transactions.
- The proposed scheme uses the concept of dynamic multi-threshold, unlike VIDPSO which has a fixed threshold. VIDPSO continues to delete victims until all the sensitive itemsets are masked, however, the proposed scheme decides the threshold value according to the length and sensitivity value of the pattern.
- Further, the flexibility of treating each sensitive item according to its importance in terms of utility, allows for changing the privacy-based parameter and hence deleting lesser information to mask sensitive information.

Data similarity

Besides FTH and NTH, loss of data in terms of information deleted should be determined to evaluate the performance of an algorithm. We evaluated the Data Loss of proposed schemes by presenting the percentage of modified transactions and the victim items deleted for hiding sensitive patterns in Table 10. Column 1, represents the considered dataset, column 2 shows the considered static threshold value and column 3 presents the percentage of sensitive patterns. Columns 4 illustrate the percentage of sensitive transactions modified to hide sensitive knowledge. Further Table 11 shows the loss of data with varying minimum threshold values with respect to conventional schemes. It can be observed that the PSO2TD and ACS2DT lose a higher percentage of data as transactions are deleted in order to mask sensitive patterns. However, VIDPSO shows lower data loss due to victim item deletion for hiding sensitive knowledge. The proposed scheme shows promising results and shows comparatively better results because of multi-threshold values considered for sensitive itemsets.

Table 10 Analysis of data loss in terms of percentage of sensitive transactions modified with varying minimum threshold and % of sensitive patterns.

Full size table

Table 11 Data loss analysis with varying minimum threshold and percentage of sensitive patterns with respect to the proposed and the benchmark algorithms.

Full size table

Further, we evaluated the overall effect of the hiding procedure on the original dataset in terms of data loss. The similarity is maintained by a sanitized dataset concerning the original one as shown in Eq. (19).

$$\begin{aligned} \mathcal{D}\mathcal{S}= \frac{\mid D'\mid }{\mid D\mid } \end{aligned}$$

(19)

If $\mathcal{D}\mathcal{S}$ is higher, it shows that there are a lesser number of modifications.

Here, $\mid D' \mid$ represents the size of the sanitized dataset and $\mid D \mid$ is the size of the original dataset. Two sets of experiments have been conducted i.e. with varying minimum support threholds as well as for varying percentage of sensitive itemsets. Figure 11 shows the plot between data similarity achieved with varying minimum support thresholds. It can be observed that the proposed scheme produces a sanitized dataset with high similarity to the original dataset. This is because the proposed scheme incorporates a lower number of modifications in comparison to ACS2DT and PSO2DT whereas the proposed algorithm is deleting victims rather than deletion of whole sensitive transactions. Further, in comparison to VIDPSO, our scheme possesses a lower number of deletions due to multi-threshold constraints which inturn offers less modification in the data. Hence, the proposed scheme performs significantly better and generates a sanitized dataset with high similarity, privacy, and utility.

Figure 12 shows the similarity achieved with varying percentages of sensitive itemsets. It should be noted that the performance of the proposed scheme is again better than the other state-of-the-art algorithms.

Conclusion

Privacy and utility are the two orthogonal properties that should be balanced while hiding sensitive information during data transformation. The scheme proposes dynamic multi-threshold constraints to categorize sensitive patterns in terms of sensitivity and length, and deletion of high-impact victim items which mitigates the effect of modifications and reduces the side-effect. It is to be noted that the algorithms ACS2DT and PSO2DT delete the whole transaction, but our proposed algorithm removes the victim item from the transactions. In VIDPSO, the authors are also removing victim items but they are using a fixed threshold value. In our approach, we are deleting the victim items with a dynamic multi-threshold value. This improves the utility, reduces the data loss, and increases the similarity of sanitized data with the original dataset. Furthermore, the high exploration capability of Particle Swarm Optimization selects a near-optimal set of transactions for introducing modification and masking sensitive information. A reproducible set of experiments has been conducted which shows that the proposed scheme has significantly better performance against state-of-the-art algorithms.

Dynamic multi-thresholding opens new doors to creating more context-aware and adaptive sanitization strategies. Future scope may include learning threshold values dynamically using machine learning. In the future, extensions can apply this victim-item deletion logic in federated learning models, where sensitive data never leaves the user’s device. Furthermore, developing robust impact scoring mechanisms that assess the influence of item deletion on downstream tasks such as classification or clustering are the areas that need.

Data availability

The data that supports the findings of this study are available from the first author on request.

References

Sharma, U., Toshniwal, D. & Sharma, S. A sanitization approach for big data with improved data utility. Applied Intelligence 50, 2025–2039 (2020).
Article Google Scholar
Lin, C. W., Hong, T. P., Yang, K. T. & Wang, S. L. The ga-based algorithms for optimizing hiding sensitive itemsets through transaction deletion. Applied Intelligence 42, 210–230 (2015).
Article Google Scholar
Lin, C.W. et al. Efficiently hiding sensitive itemsets with transaction deletion based on genetic algorithms. The Scientific World Journal 2014, 398269 (2014).
Wu, J. M. T., Srivastava, G., Lin, J. C. W. & Teng, Q. A multi-threshold ant colony system-based sanitization model in shared medical environments. ACM Trans. Internet. Technol. 21, 1–26 (2021).
Article Google Scholar
Wu, J. M. T., Zhan, J. & Lin, J. C. W. Ant colony system sanitization approach to hiding sensitive itemsets. IEEE Access 5, 10024–10039 (2017).
Article Google Scholar
Wu, J.M.T., Lin, C.W., Fournier-Viger, P., Djenouri, Y., Chen, C.H., & Li, Z. The density-based clustering method for privacy-preserving data mining. Math. Biosci. Eng.16(3), 1718–1728 (2019a).
Lin, J. C. W. et al. A sanitization approach for hiding sensitive itemsets based on particle swarm optimization. Eng. Appl. Artif. Intell. 53, 1–18 (2016).
Article Google Scholar
Gui, Y., Gan, W., Wu, Y., & Philip, S.Y. Privacy preserving rare itemset mining. Info. Sci. 662, 120262 (2024).
Ashraf, M., Rady, S., Abdelkader, T. & Gharib, T. F. Efficient privacy preserving algorithms for hiding sensitive high utility itemsets. Computers & Security 132, 103360 (2023).
Article Google Scholar
Telikani, A., Shahbahrami, A., Shen, J., Gaydadjiev, G. & Lin, J. C. W. An edge-aided parallel evolutionary privacy-preserving algorithm for internet of things. Internet of Things 23, 100831 (2023).
Article Google Scholar
M. Rajasekaran, M. T. & Meenakshi, A. Association rule hiding using enhanced elephant herding optimization algorithm. Automatika 65, 98–107. https://doi.org/10.1080/00051144.2023.2277998 (2024).
Article Google Scholar
Wu, W., Wu, X. & Wan, Y. Single-image shadow removal using detail extraction and illumination estimation. Vis. Comput. 38, 1677–1687 (2022).
Article Google Scholar
Zhang, S. et al. Semantic-aware dehazing network with adaptive feature fusion. IEEE Trans. Cybern. 53, 454–467 (2021).
Article Google Scholar
Duan, J., Xiong, J., Li, Y., & Ding, W. Deep learning based multimodal biomedical data fusion: An overview and comparative review. Inf. Fusion.. 112, 102536 (2024).
Wang, J., Li, X. & Ma, Z. Multi-scale three-path network (mstp-net): A new architecture for retinal vessel segmentation. Measurement 250, 117100 (2025).
Article Google Scholar
Holland, J.H. Adaptation in natural and artificial systems: an introductory analysis with applications to biology, control, and artificial intelligence. MIT press (1992).
Kennedy, J., & Eberhart, R. Particle swarm optimization, In: Proceedings of ICNN’95-international conference on neural networks, IEEE. 1942–1948 (1995).
Dorigo, M. & Gambardella, L. M. Ant colony system: a cooperative learning approach to the traveling salesman problem. IEEE Trans. Evol. Comput. 1, 53–66 (1997).
Article Google Scholar
Han, S., & Ng, W.K. Privacy-preserving genetic algorithms for rule discovery. In: International conference on data warehousing and knowledge discovery, Springer, 407–417 (2007).
Cheng, P., & Pan, J.S. Association rule hiding based on evolutionary multi-objective optimization by removing items. In: Proceedings of the AAAI Conference on Artificial Intelligence (2014).
Bonam, J., Ramamohan Reddy, A., & Kalyani, G. Privacy preserving in association rule mining by data distortion using pso. In: ICT and Critical Infrastructure: Proceedings of the 48th Annual Convention of Computer Society of India-Vol II: Hosted by CSI Vishakapatnam Chapter, Springer. 551–558 (2014).
Kalyani, G., Chandra Sekhara Rao, M., & Janakiramaiah, B. Particle swarm intelligence and impact factor-based privacy preserving association rule mining for balancing data utility and knowledge privacy. Arab. J. Sci. Eng. 43, 4161–4178 (2018).
Afshari, M. H., Dehkordi, M. N. & Akbari, M. Association rule hiding using cuckoo optimization algorithm. Expert. Syst. Appl. 64, 340–351 (2016).
Article Google Scholar
Janakiramaiah, B., Kalyani, G., Chittineni, S., & Narendra Kumar Rao, B. An unbiased privacy sustaining approach based on sgo for distortion of data sets to shield the sensitive patterns in trading alliances, in: Smart Intelligent Computing and Applications: Proceedings of the Second International Conference on SCI 2018, Volume 2, Springer. pp. 165–177 (2019).
Cheng, P., Roddick, J. F., Chu, S. C. & Lin, C. W. Privacy preservation through a greedy, distortion-based rule-hiding method. Applied Intelligence 44, 295–306 (2016).
Article CAS Google Scholar
Lin, J. C. W., Zhang, Y., Zhang, B., Fournier-Viger, P. & Djenouri, Y. Hiding sensitive itemsets with multiple objective optimization. Soft Computing 23, 12779–12797 (2019).
Article Google Scholar
Jangra, S. & Toshniwal, D. Vidpso: Victim item deletion based pso inspired sensitive pattern hiding algorithm for dense datasets. Inf. Process. Manag. 57, 102255 (2020).
Article Google Scholar
Wu, T. Y., Lin, J. C. W., Zhang, Y. & Chen, C. H. A grid-based swarm intelligence algorithm for privacy-preserving data mining. Applied Sciences 9, 774 (2019).
Article CAS Google Scholar

Download references

Acknowledgements

This research was supported by Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education (No. 2021R1F1A1055408). This work was supported by Institute of Information & communications Technology Planning & Evaluation(IITP) grant funded by the Korea government(MSIT) (No.RS-2025-02216517, Development of Reconfigurable AI Processor Software Framework Technologies).

Author information

Authors and Affiliations

Department of Computer Science and Engineering, Thapar Institute of Engineering and Technology, Patiala, Punjab, India
Shivani Sharma & Rohan Sharma
Akian College of Science and Engineering, American University of Armenia, Yerevan, 0019, Armenia
Sachin Kumar
School of Computing, Gachon University, Seongnam, Republic of Korea
Hong Min

Authors

Shivani Sharma
View author publications
Search author on:PubMed Google Scholar
Rohan Sharma
View author publications
Search author on:PubMed Google Scholar
Sachin Kumar
View author publications
Search author on:PubMed Google Scholar
Hong Min
View author publications
Search author on:PubMed Google Scholar

Contributions

Conceptualization: S.S., R.S., S.K.; Methodology: S.S., R.S.; Software: S.S., R.S.; Validation: S.S., S.K., H.M.; Visualization: S.S., S.K., H.M.; Writing-original draft: S.S., R.S.; Writing-reviewing and editing: S.S., S.K., H.M.; Supervision: S.S., S.K., H.M.; Funding acquisition: H.M. All authors reviewed the manuscript.

Corresponding author

Correspondence to Hong Min.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Annexure A

Here, we discuss the role of the default size of a sub-particle over the deliverables considered in this work. In the presented work, we have taken the default sub-particle size as 1. However, it may vary according to the considered datasets and the application because, in our experiments, we have noted that there are high instances of negatively sized sub-particles for HD and HAP datasets, whereas, in Chess and Mushroom datasets, there are few instances of negatively sized sub-particles. Since the number of transactions is lesser in HD and HAP datasets in comparison to Chess and Mushroom datasets, this may be the reason for obtaining negatively sized sub-particles while applying the dynamically multi-threshold-based phenomenon.We are able to show this reasoning by varying the size of sub-particles over the proposed scheme using benchmark datasets as shown in Fig. 13. It should be noted that fitness value as well as NTH value is not much affected by the size of the sub-particle. However, the FTH value varies with different sizes of sub-particles. In the Fig. 14, we have taken the following sizes of the sub-particles

$$\Bigg \{1,\frac{m_j}{4},\frac{m_j}{3},\frac{m_j}{2} \Bigg \},$$

where $m_j$ refers to the size of sub-particles determined by Eq. (11), such that for each of the considered datasets, we noted that the value of FTH is decreasing with increasing sizes of sub-particles considered in the above mathematical expression. Rather than taking a fixed size of sub-particles, we have also taken a range of the size of sub-particles where a random value is being selected from that range. We have taken the following range of sizes of these sub-particles

$$\Bigg \{[1,\frac{m_j}{4}], [1,\frac{m_j}{3}], [\frac{m_j}{4},\frac{m_j}{2}]\; \text {and} \;[\frac{m_j}{3},\frac{m_j}{2}]\Bigg \}$$

. The results of FTH values for these ranges are shown in the Table 12. We have again noted that the fitness value as well as NTH value has a similar type of behavior as the earlier taken sizes of sub-particles, hence we are only showing the results about FTH value. It should be noted from the Table 12 that the FTH value is lesser for such ranges of sub-particles in comparison to those cases where the size of the sub-particle is only 1.

Table 12 FTH Values obtained in proposed algorithm with sub particle size varying in a range, for example, $m_j/ 4$ to $m_j/2$ shows that the sub particle sizes lie in between $m_j/4\; \text {to}\; m_j/2$ where $m_j$ is being determined by Eq. (11).

Full size table

Annexure B

The performance of the proposed algorithm in terms of execution time is evaluated, conducting various experiments with varying minimum support threshold and varying percentage of sensitive itemsets, as shown in Fig. 15 It can be observed from these figures that PSO2DT and ACS2DT takes the highest execution time as compared to the rest of the algorithms due to more number of transactions considered for deletion. The algorithms which are based on transaction deletion like PSO2DT and ACS2DT have to calculate artificial itemsets which would be generated. While in the case of proposed scheme artificial itemsets calculation is not required, which improves execution time. As the minimum support threshold increases, the number of frequent itemsets generated decreases and so are the sensitive itemsets. Hence, the execution time decreases with the decrement in minimum supoort threshold. The execution time depends on the number of subparticles and their size in proposed algorithm. Unlike VIDPSO proposed scheme the subparticle size vary with respect to sensitivity and length of victim item hence, the number of modification is less when size of subparticle is small and inturn execution time is less whereas in VIDPSO the sub particle size is strictly same hence execution time is bit more. From the plots shown in Fig. 15 it can be concluded that the execution taken by the proposed algorithm is either less or comparable to other evolutionary algorithms Further, to assess scalability, we implemented the proposed scheme across a range of datasets of varying sizes, as illustrated in Fig. 16 of Annexure B. The evaluation demonstrated that the approach is scalable, consistently completing its process within real-time constraints. (Note: in this experiment we have taken chess dataset and randomly added transactions in original dataset for increasing the size of dataset)

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Sharma, S., Sharma, R., Kumar, S. et al. An evolutionary computation-based sensitive pattern hiding model under a multi-threshold constraint in healthcare. Sci Rep 15, 35062 (2025). https://doi.org/10.1038/s41598-025-03346-4

Download citation

Received: 30 September 2024
Accepted: 20 May 2025
Published: 08 October 2025
DOI: https://doi.org/10.1038/s41598-025-03346-4

Subjects

Abstract

Similar content being viewed by others

Application of information theoretic feature selection and machine learning methods for the development of genetic risk prediction models

Exploring the cytotoxic effects of bioactive compounds from Alcea rosea against stem cell driven colon carcinogenesis

Cooperative metaheuristic algorithm for global optimization and engineering problems inspired by heterosis theory

Introduction

Definition 1

Motivation

Contribution

Definition 2

Definition 3

Related work

Preliminaries and problem statement

Preliminaries

Definition 4

Definition 5

Definition 6

Definition 7

Definition 8

Definition 9

Definition 10

Definition 11

Problem statement

Conventional methodology used by existing EB-PPDM

Proposed sensitivity and length based multi threshold victim deletion based data sanitization scheme

Sensitivity and length evaluation

Dynamic-multi-threshold calculation

Near optimal victim item identification

Transaction exploration and sanitization

Particle swarm optimization:

Our approach

Definition 12

Definition 13

Definition 14

Experimental results

Fitness values

Privacy and utility

Data similarity

Conclusion

Data availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Additional information

Publisher’s note

Appendices

Annexure A

Annexure B

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Quick links