An efficient and interactive feature selection approach based on copula entropy for high-dimensional genetic data

Yan, Xiaoran; Shang, Shilong; Li, Dongxi; Dang, Yun

doi:10.1038/s41598-025-15068-8

Download PDF

Article
Open access
Published: 17 August 2025

An efficient and interactive feature selection approach based on copula entropy for high-dimensional genetic data

Xiaoran Yan²^na1,
Shilong Shang¹^na1,
Dongxi Li¹ &
…
Yun Dang²

Scientific Reports volume 15, Article number: 30100 (2025) Cite this article

3034 Accesses
5 Citations
1 Altmetric
Metrics details

Subjects

Abstract

Feature selection (FS) is especially important for high-dimensional data. In this paper, we propose an efficient and interactive feature selection approach based on copula entropy (CEFS+). The method combines feature-feature mutual information with feature-label mutual information and uses a maximum correlation minimum redundancy strategy for greedy selection. The approach uses copula entropy as a measure of feature relevance that captures the full-order interaction gain between features. Moreover, we prove the divisibility of multivariate mutual information, and derive a novel feature criterion, and propose a feature selection approach based on copula entropy called CEFS. Meanwhile, to overcome the instability of the CEFS method on some datasets, we propose the improved method CEFS+ which based on the rank technique. Finally, we evaluate the effectiveness of CEFS and CEFS+ using three classifiers on five datasets. In 10 out of 15 scenarios, our approach obtains the highest classification accuracy, which is much higher than the other six commonly used FS methods. In particular, our approach performs better on high-dimensional genetic datasets.

Compressed deepfake detection via GA-LASSO selection of deep features and machine learning models

Article Open access 23 March 2026

Hybrid similarity relation based mutual information for feature selection in intuitionistic fuzzy rough framework and its applications

Article Open access 12 March 2024

Double trouble: a comprehensive study into unrelated genetic comorbidities in adult patients with Facioscapulohumeral Muscular Dystrophy Type I

Article Open access 07 January 2025

Introduction

Background

Data is an important basis for understanding the world. Processing and analyzing data is an important means of improving productivity. In the digital age, data has become a valuable asset. In recent years, with the improved performance of computers and other electronic devices, trillions of data are generated every minute. A large portion of data is high-dimensional data. High-dimensional data are very common in many fields, such as disease analysis and diagnosis, social platforms, financial analysis, weather forecasting, online education, bioinformatics. The processing of high-dimensional data is a challenge. On the one hand, high-dimensional data significantly increase the computational and storage complexity of classification models; On the other hand, in high-dimensional data, there are many irrelevant or redundant features, and these “low-value” features not only greatly reduce the accuracy of the classifier, but also make it difficult to analyze the issues.

Dimensionality reduction is a approach to solve the above problems. Dimensionality reduction can be categorized into two kinds: feature selection (FS) and feature extraction (FE). FS directly identifies feature subsets from the original set, while FE is to project the original high-dimensional features into a new low-dimensional feature space, and the newly constructed feature space is usually a linear or nonlinear combination of the original features¹. Theory and practice have proven that both FS and FE have the advantages of improving learning performance, increasing computational efficiency, reducing memory storage, and constructing better generalized models. However, compared to FE, the selected features of FS are the original features, so FS has better interpretability.

Feature selection (FS), also known as variable, attribute, or variable subset selection, is commonly used in machine learning and statistics to select a subset of features and thus construct a model that describes the data. There are two important aspects of feature selection: maximum correlation and minimum redundancy. Maximum relevance means that the subset of features contains the most amount of useful information about the label and minimum redundancy means that the subset of features contains the least amount of duplicate information about the label. In other words, FS is to select as few features as possible to obtain as much information about the label as possible. After FS, we can build smaller models with the selected features, which can greatly reduce complexity and improve interpretability as well as accuracy. Therefore, FS has a wide range of applications in many fields and is extremely important.

Types of FS

FS approaches can be broadly categorized into four groups based on the selection method, which are filter approach, wrapper approach, embedding approach and hybrid approach.

Filter approach²: This approach does not use machine learning to select a subset of features. It is used in many datasets with the number of features is high. Filter based feature selection methods are faster than wrapper based methods.

Wrapper approach³: This approach uses the accuracy of machine learning to assess the importance of the features used. Wrapper methods can provide high classification accuracy for specific classifiers, but usually they have high computational complexity.

Embedded approach²: This approach performs feature selection during training and is specific to the applied machine learning algorithm.

Hybrid approach⁴: This approach is a combination of filter and wrapper approach. The filter approach selects the candidate feature set from the original feature set and refines the candidate feature set by the wrapper approach. It combines the advantages of these two methods.

Based on the selection criteria, FS approaches can also be classified into the following four categories⁵: statistical measure based FS, information theory based FS, similarity measure based FS and sparse learning measure based FS.

Applications of FS

FS can effectively deal with high-dimensional data and improve the learning efficiency, so it is widely used in many fields such as data mining and machine learning, network anomaly detection and natural language processing.

Machine learning is appropriate when the task is defined by a series of cases or examples rather than in terms of an algorithm or rules. Machine learning has a wide range of applications in many fields, including robotics, pattern recognition, and bioinformatics. Classifiers can acquire useful information by learning features from data, and based on this acquired information, further predict labels for new data. In practical applications, the features are first processed by FS and then the classifier can use a subset of the selected features to learn useful information about the labels.

Real-time anomaly detection is extremely challenging for network security researchers and practitioners. Network data contains a large number of features, so anomaly detection systems need to spend a lot of time processing features. To solve this problem, we need to conduct FS to identify as few valuable features as possible from network data.

Recently, content-based document management tasks have become very important. Text categorization assigns a boolean value to a document which reflects the category of this document. In document categorization, the occurrence of any word in a document can be considered as a feature. The feature set used reflects certain properties of the words and their context in the actual text. A good feature set can quickly predict the category of a document. In automatic text categorization, the feature space contains unique terms (words or phrases) of the document, which may number in tens, hundreds or thousands. Therefore, the operational cost of text categorization is not only time-consuming but also tricky in many cases. Therefore, it is highly desirable to reduce the feature space in order to categorize documents efficiently.

Gene analysis is a very important topic in bioinformatics. Microarrays can monitor the expression of thousands of genes in parallel, which can generate a large amount of valuable data. Due to the high-dimensionality and overfitting problems of gene expression data, it is difficult to obtain satisfactory classification results by machine learning techniques. In addition, we would like to find effective genes that cause specific diseases from a large number of genes. This makes FS particularly important when dealing with genetic data⁶.

Motivation and contribution

FS is an important research area in knowledge discovery and data mining. Due to the rapid development of technology, people are acquiring more and more data, which also become larger and larger. FS is essential for high dimensional data. In many application areas, such as cyber security and bioinformatics, FS plays an important role in reducing complexity and improving prediction accuracy. Although many FS methods have been proposed in the past, there are still some shortcomings. Especially in the field of bioinformatics, where genes interact with each other and jointly express physiological and pathological changes, it is necessary to consider correlation, redundancy, and interaction gain simultaneously to obtain the optimal subset of features⁷, and thus achieve the best prediction accuracy. This is not available in most of the current feature selection methods. To address this, we propose a novel copula entropy based FS method.

The main contribution of this paper is to propose an interactive feature selection method based on copula entropy (CEFS+), which can be well applied to high-dimensional data with feature interaction, such as genetic data. It is well known that certain diseases are jointly determined by two or more genes. The value of these genes as a whole is more than the sum of the value of each gene individually, which we call the interaction gain. The new FS method proposed in this paper can effectively capture these interaction gain and thus select a subset of better quality features. We validate its effectiveness using three classifiers on five datasets. In 10 out of 15 scenarios, our method obtained the highest classification accuracy, which is much higher than the other six commonly used FS methods. Our method performs even better on the 3 high-dimensional genetic datasets, all of which have the highest average accuracy. Our contributions can be summarized as follows:

We prove the divisibility of multiple mutual information: The information in the variable set pointing to a variable is equal to all the information minus the information in the variable set.
We organically combine the divisibility of multiple mutual information with copula entropy theory, obtain a novel measure of feature selection, and verify its feasibility on small data.
In response to the pre-existing disadvantages of the basic version of the proposed method, we improve it by adopting the rank strategy and propose the CEFS+ method, which achieves good results.

The rest of the paper is organized as follows. In Section 2, we briefly explain the related work and motivation. Section 3 introduces the theoretical foundations of information theory. The proposed method are discussed in Section 4. Experimental results are shown in Section 5. Finally, conclusions are discussed in Section 6.

Related work

Currently, scholars have proposed many FS approaches. The mainstream idea of these approaches is to search for the best subset of features that provide good prediction results⁸.

Most FS approaches are based on certain criteria (e.g., statistical measures and information theory), and some are population-based heuristics (e.g., particle swarm optimization, ant colony optimization, simulated annealing, and genetic algorithms). Bhatt and Gopal in 2005 used fuzzy rough set theory for FS based on fuzzy t-paradigm numbers and the natural attributes of t-Conorms⁹. Vignolo, Milone and Scharcanski in 2013 introduced a novel FS method based on multi-objective evolutionary wrapper using genetic algorithm¹⁰.

The similarity measure based Relief algorithm¹¹, proposed by Kira and Rendell in 1992, selects features based on their ability to distinguish between close samples.The Relief algorithm operates efficiently and has no restriction on data types, but is limited to binary classification problems. In order to overcome the shortcomings of Relief algorithm, in 1994 Kononeill proposed an improved algorithm ReliefF¹², which can deal with multi-classification problems and regression problems, and it is recognized as a filter-type FS algorithm with better results¹³. However, the disadvantage of Relief series algorithms is that they cannot remove redundant features. Mitra, Murthy, and Pal (2002) proposed an unsupervised feature subset selection method¹⁴ that utilizes feature similarity to eliminate redundancy among features. They used a new metric called maximum information compression index to compute the similarity between two random variables for FS. However, all of the above methods lack the consideration of interaction gain between features.

Sparse learning based approach devotes to reduce the fitting error and also incorporates some sparse regularization terms. Many feature coefficients are forced to be smaller or zero by the sparse regularization terms, and then the corresponding features can be easily deleted. In 1996 Tibshiran¹⁵ replaced the L2 norm penalized regular term for ridge regression estimation with the L1 norm penalized regular term, and formally proposed Lasso (Least Absolute Shrinkage and Selection Operator). In 2005, Zou and Hastie proposed the Elastic Net (EN)¹⁶ method to overcome some shortcomings of Lasso. Elastic Net combines the two regularization methods, ridge regression and Lasso regression, and controls the size of the penalized term jointly by the two parameters $\lambda _1$ and $\lambda _2$.

The FS based on information measure criterion is associated with information entropy and mutual information theory and has a solid theoretical foundation. Battiti (1994) proposed a mutual information based FS algorithm called MIFS¹⁷. This algorithm considers feature-feature and feature-class mutual information. It uses a greedy strategy to select features to maximize the information about the label. Kwak and Choi (2002) proposed an algorithm called MIFS-U¹⁸ to overcome the limitations of MIFS and obtained better results than MIFS. Peng, Long and Ding (2005)¹⁹ proposed a mutual information based maximum relevance and minimum redundancy (mRMR) approach which maximizes the relevance between features and label and minimizes the redundancy between features. Similarly, Estevez et al. proposed Normalized Mutual Information FS (NMIFS) in 2009²⁰, Cheng et al. proposed Conditional Mutual Information FS (CMI) in 2011²¹. However, none of the above methods take into account the fact that mutual information is inclined to select features with more discrete values²², nor do they take into account the interaction gain between features, which makes the results biased. To address these shortcomings, Sun et al. proposed a Dynamic Weighting based FS (DWFS) algorithm in 2013²³. Zeng et al. proposed an Interaction Weight based FS (IWFS) algorithm in 2015²⁴. These two methods used symmetric uncertainty (SU)²⁵, which is a normalized form of mutual information, to solve the first deficiency, but still did not take into account the relevance, redundancy and interaction gain. So, in 2020, Gu et al. did some improvements and proposed a Redundancy Analysis and Interaction Weight based FS algorithm²⁶. However, a redundancy coefficient $\alpha$ was introduced in this method, and determining the value of the redundancy coefficient $\alpha$ became a new problem. In 2021, MA Jian proposed a new FS method based on copula entropy²⁷, and released a program package for calculating the copula entropy, but only the correlation was considered.

In recent years, information theory-based feature selection methods have centered on balancing feature-label relevance and reducing feature redundancy, a rationale that is particularly crucial in multi-label scenarios. Early approaches often neglected the joint information between candidate and selected features. In 2019, Zhang et al. integrated feature relevance and feature redundancy terms to design a multi-label feature selection method named Label Redundancy-based Feature Selection (LRFS)²⁸. In 2021, the method named Feature Selection considering Conditional-Weight Joint Relevance (CWJR-FS)²⁹, proposed by Zhang et al., improved upon this by integrating joint mutual information and introducing conditional weights to filter redundancy, thereby effectively capturing the comprehensive classification information among features. Addressing the specificity of multilabel scenarios, Zhang and Gao put forward the method Double Conditional Relevance-Multilabel Feature Selection (DCR-MFS) in 2021³⁰, incorporating the influence of selected features and label associations through the ”dual conditional relevance” metric. The Relevance based on Weight Feature Selection (RWFS) method, proposed by Hu et al. In 2022³¹, further integrates the information change rate to more comprehensively evaluate the importance of features to the label set. Additionally, Li et al. proposed a different approach in their Robust Flexible Sparse Regularized Multilabel Feature Selection (RFSFS) method in 2023³², which balances label-specific features and redundancy control through robust flexible sparse regularization, thus providing a regularization perspective for solutions. In 2024, Han et al. proposed the RRFS method, focusing on more complex multiview multilabel scenarios³³. It uses dual coefficients to balance label information gain and view redundancy, respectively, thereby adapting to complex data structures. These studies have advanced the development of multi-label feature selection from various perspectives, offering important references for subsequent research.

Theoretical foundations of information theory

Problem formulation and symbols

To describe the algorithms in this paper, we use symbols and notations given in Table 1. For a given dataset D of dimension d, with a feature set $F=\left\{ f_1, f_2, \ldots , f_d\right\}$, the problem is to select an optimal subset of relevant features $F^{\prime }$ where (i) $F^{\prime } \subseteq F$ and (ii) for $F^{\prime }$, a classifier gives the best possible classification accuracy. In other words, we aim to identify a subset of features which satisfies the condition that the feature–feature mutual information is minimum and feature–class mutual information is maximum.

The proposed method uses copula entropy to compute mutual information and select features. The method combines feature-class correlation, feature-feature redundancy and interaction gain between features.

Table 1 Symbols used in our method.

Full size table

Information entropy

In 1948, Shannon introduced the concept of information theory³⁴, which provides an effective way to measure the information between random variables. Entropy can be understood as a measure of the uncertainty of a random variable. The greater the entropy of a random variable, the greater the uncertainty. If we let $X=\left\{ x_1, x_2, \ldots , x_l\right\}$ be a discrete random variable with probability distribution $p(x)=P(X=x)$, $x \in X$. Then the entropy of X can be defined as:

$$\begin{aligned} H(X)= -\sum _{i = 1}^l p(x_i) \log p(x_i) \end{aligned}$$

(1)

where $p(x_i)$ is the probability value of $X=x_i$. Here log has a base of 2 and 0log0 = 0.

If we make $Y=\left\{ y_1, y_2, \ldots , y_m\right\}$ also a discrete random variable and $p(x_i,y_j)$ the joint probability of X and Y. Then their joint entropy can be defined as:

$$\begin{aligned} H(X, Y)= -\sum _{i = 1}^l\sum _{j = 1}^m p(x_i, y_j)\log p(x_i, y_j) \end{aligned}$$

(2)

If a random variable X is given, then the measure of uncertainty with respect to a random variable Y is defined as the conditional entropy, which is given as follows:

$$\begin{aligned} H(Y \mid X) = H(X, Y)-H(X) = -\sum _{i=1}^l \sum _{j=1}^m p\left( x_i, y_j\right) \log p\left( y_j \mid x_i\right) \end{aligned}$$

(3)

where $p\left( y_j \mid x_i\right)$ is the conditional probability of $Y=y_j$ under the given condition $X=x_i$.

Mutual information

In information theory, mutual information (MI) I(X; Y) is the amount of uncertainty in X due to the knowledge of Y³⁵. Mathematically, mutual information is defined as:

$$\begin{aligned} I(X ; Y)=\sum _{i=1}^l \sum _{j=1}^m p\left( x_i, y_j\right) \log \frac{p\left( x_i, y_j\right) }{p\left( x_i\right) p\left( y_j\right) } \end{aligned}$$

(4)

We can also say:

$$\begin{aligned} I(X ; Y)=H(X)-H(X \mid Y) \end{aligned}$$

(5)

This substantiates the intuitive meaning of mutual information as the amount of information that knowing either variable provides about the other. According to Eq. (3), we can also write it as:

$$\begin{aligned} I(X ; Y)=H(X)+H(Y)-H(X, Y) \end{aligned}$$

(6)

Multivariate mutual information as a measure of the mutual dependence of random variables defined as follows:

$$\begin{aligned} I(\varvec{z})=\sum _{i=1}^n H\left( z_i\right) -H(\varvec{z}) \end{aligned}$$

(7)

where $\varvec{z}=\left[ z_1, \cdots , z_n\right]$ are random variables. Unlike traditional statistical measures of dependence, such as correlation (i.e. second-order statistics) and higher-order statistics³⁶, I is a measure of all orders of dependence in this equation.

Relevance, redundancy and interaction

The objects of relevance are feature and label. John et al. categorized features into three disjoint classes: strong correlation, weak correlation, and irrelevant¹³. If the label is c, these correlation types can be expressed as follows:

Definition 1

(strong correlation) A feature $f(f\in F)$ is considered strong correlation if and only if:

$$\begin{aligned} P(c \mid {F}) \ne P\left( c \mid F-\left\{ f\right\} \right) \end{aligned}$$

(8)

Definition 2

(weak correlation) A feature f is weak correlation if $P(c \mid F) = P\left( c \mid F-\left\{ f\right\} \right)$ and $\exists S \subseteq F-\left\{ f\right\}$ satisfy the following condition:

$$\begin{aligned} P\left( c \mid f, S\right)&\ne P(c \mid S) \end{aligned}$$

(9)

Definition 3

(irrelevance) A feature f is considered irrelevant if $\forall S \subseteq F-\left\{ f\right\}$ satisfy the condition as follow:

$$\begin{aligned} P\left( c \mid f, S\right)&= P(c \mid S) \end{aligned}$$

(10)

The objects of redundancy are feature and feature. Yu and Liu proposed the definition of redundancy based on the concept of Markov blanket³⁷.

Definition 4

(redundancy) A feature f is redundant if it is weak correlation and there is a markov blanket in the feature set F.

The objects of interaction are feature and feature, too. One explanation of interaction provided by McGill³⁸ is as follows: Interaction is the amount of information that is increased (or decreased) by attribute f when other attributes are known. On this basis, we can give the definition of interaction:

Definition 5

(interaction) Let $F_a \subseteq F$ and $F_b = F-F_a$. There is an interaction between $F_a$ and $F_b$ if:

$$\begin{aligned} I(F ; c) > I\left( F_a ; c\right) + I\left( F_b ; c\right) \end{aligned}$$

(11)

This indicates that the information of $F_a$ and $F_b$ as a whole is greater than the sum of themselves individually. Interactions are common in gene expression. There are multiple genes that work together to determine a trait commonly.

Taking Table 2 as an example, it can be seen that both $f_a$ and $f_b$ are strongly correlated with the label c, and that $f_a$ and $f_b$ are redundant. $f_c$ is irrelated with the label c (in reality there are very few cases where it is completely irrelated, and here we give the example of $f_c$ being a constant). $f_d$ and $f_e$, when viewed in isolation, do not give us any useful information about c. However, when $f_d$ and $f_e$ are taken as a whole we find that the dissimilarity results in c. Therefore, $f_d$ and $f_e$ are interacting, and both $f_d$ and $f_e$ are weakly related to the label c.

Table 2 An example of features with relevance, redundancy, and interaction.

Full size table

Copula entropy

The entropy of the copula, named the copula entropy, is defined as a measure of the dependence uncertainty represented by the copula function.

Definition 6

(copula entropy) Let $\varvec{z}=\left[ z_1, \cdots , z_n\right]$ be random variables with marginal functions $\varvec{u}=\left[ \varPhi _1, \cdots , \varPhi _n\right]$ and copula density $c(\varvec{u})$. Copula entropy of $\varvec{z}$ is defined as:

$$\begin{aligned} H_{\textrm{c}}(\varvec{z})=-\int _u c(\varvec{u}) \log c(\varvec{u}) \textrm{d} \varvec{u} \end{aligned}$$

(12)

where $c(\varvec{u})=\frac{\textrm{d}^n C(\varvec{u})}{\textrm{d} u_1 \mathrm {~d} u_2 \cdots \textrm{d} u_n}$, $P(\varvec{z})=C\left( \varPhi _1\left( z_1\right) , \cdots , \varPhi _n\left( z_n\right) \right)$, C is copula function, and P is a joint distribution function with margins $\left\{ \varPhi _1, \varPhi _2, \ldots , \varPhi _n\right\}$.

In 2011, Jian Ma et al. theoretically proved the equivalence between copula entropy and the concept of mutual information³⁹.

Theorem 1

The mutual information of random variables is equivalent to their negative copula entropy:

$$\begin{aligned} I(\varvec{z})=-H_{\textrm{c}}(\varvec{z}) \end{aligned}$$

(13)

Proof

$$\begin{aligned} I(\varvec{z})&= \int _{\varvec{z}} p(\varvec{z}) \log \frac{p(\varvec{z})}{\prod _i p_i(z_i)} \, \textrm{d} \varvec{z} \end{aligned}$$

(14)

$$\begin{aligned}&= \int _{\varvec{z}} c(\varvec{u}_{\varvec{z}}) \prod _i p_i(z_i) \log c(\varvec{u}_{\varvec{z}}) \, \textrm{d} \varvec{z} \end{aligned}$$

(15)

$$\begin{aligned}&= \int _{\varvec{z}} c(\varvec{u}_{\varvec{z}}) \log c(\varvec{u}_{\varvec{z}}) \, \textrm{d} \varvec{u}_{\varvec{z}} \end{aligned}$$

(16)

$$\begin{aligned}&= -H_{\textrm{c}}(\varvec{z}). \end{aligned}$$

(17)

Theorem 1 can be interpreted as mutual information exactly measuring the dependence uncertainty of random variables.

As we know, the definition of mutual information proposed by Shannon is for pair of random variables. However, the concept of copula entropy (copula entropy inherits the property that copula has all the correlation information of random variables) is no longer limited to the case of pair of variables, but it also applies to the multivariate case. Copula entropy can be used to measure the full-order correlation between multiple random variables⁴⁰, which provides us with a basis for calculating the multivariate mutual information.

In this paper, we mainly utilize copula entropy for multivariate correlation measures.

Proposed method

Derivation of the CEFS+ method

Here, let’s go back to the essence of FS. We aim to select as few features as possible to obtain as much information about the label as possible. In practice, we adopt the forward search strategy of greedy selection. First, we make the selected feature set $\varOmega _s$ an empty set. Then we select the most important feature $f^*$ from the to-be-selected feature set $\varOmega _n$ according to some selection criteria, add $f^*$ to the selected feature set $\varOmega _s$ and remove it from the to-be-selected feature set $\varOmega _n$. Repeat this step until the termination condition is satisfied. Finally, we get a list of features from high priority to low priority, which marks the completion of FS.

When selecting feature from the set of to-be-selected features $\varOmega _n$, we would like to select the feature $f^{\circ }$($f^{\circ } \in \varOmega _n$) to be organically combined with the previously selected features after it is added to the set of already-selected features $\varOmega _s$, in other words, to take $f^{\circ }$ and $\varOmega _s$ as a whole $\varOmega _s^{\circ }$, which contains the most information about the label c. Thus we can get:

$$\begin{aligned} f^*=argmax_{f^{{\circ } \in \varOmega _n}}(MI(\varOmega _s^{\circ } \sim c)) \end{aligned}$$

(18)

where $MI(\varOmega _s^{\circ } \sim c)$ denotes the mutual information between the set $\varOmega _s^{\circ }$ and the label c. It should be noted that the above equation takes into account relevance, redundancy and interaction. When a feature $f^{\circ }$ is added, $f^{\circ }$ should be organically combined with the previously selected features, and the whole contains the most information about the label c. This embodies both relevance and interaction. When a redundant feature $f^{\circ }$ is added, then $f^{\circ }$ cannot combine well with the previously selected features, the value of $MI(\varOmega _s^{\circ } \sim c)$ is low, and $f^{\circ }$ will not be selected in the actual process, which reflects redundancy.

According to the Eq. (5), there is:

$$\begin{aligned} MI( \Omega _s^{\circ } \sim c)=H(c)-H\left( c \mid \Omega _s^{\circ } \right) \end{aligned}$$

(19)

On the other hand, we find that when the label c and the set $\varOmega _s^{\circ }$ are taken as a whole $\varOmega _{sc}^{\circ }$, the multivariate mutual information $I(\varOmega _{sc}^{\circ })$ contains all the relations between the individual variables, and then $MI(\varOmega _s^{\circ } \sim c)$ is supposed to be a part of $I(\varOmega _{sc}^{\circ })$. So we get:

Theorem 2

(divisibility of multiple mutual information) The information in the variable set pointing to a variable is equal to all the information minus the information in the variable set.

$$\begin{aligned} MI( \Omega _s^{\circ } \sim c) =\textit{I}\left( \varOmega _{sc}^{\circ }\right) -I\left( \varOmega _{s}^{\circ }\right) \end{aligned}$$

(20)

Proof

$$\begin{aligned} MI( \Omega _s^{\circ } \sim c) =H(c)-H\left( c \mid \Omega _s^{\circ } \right) \\ =\left[ \sum _{f_i \in \varOmega _{sc}^{\circ }} H\left( f_i\right) -\sum _{f_i \in \varOmega _{s}^{\circ }} H\left( f_i\right) \right] -\left[ H\left( \varOmega _{s}^{\circ }, c\right) -H\left( \varOmega _{s}^{\circ }\right) \right] \\ =\left[ \sum _{f_i \in \varOmega _{sc}^{\circ }} H\left( f_i\right) -H\left( \varOmega _{sc}^{\circ }\right) \right] -\left[ \sum _{f_i \in \varOmega _{s}^{\circ }} H\left( f_i\right) -H\left( \varOmega _{s}^{\circ }\right) \right] \\ =I\left( \varOmega _{sc}^{\circ }\right) -I\left( \varOmega _{s}^{\circ }\right) . \end{aligned}$$

which shows that the information in the set $\varOmega _{sc}^{\circ }$ can be divided into two parts, one part of it is the information about c, and the other part of it is the other information (which no longer contains any information about c).

It is also worth mentioning that Eq. (20) is also exactly the full-order representation of the mRMR algorithm, taking into account the interaction . Based on the Eq. (13), we can further write Eq. (20) as:

$$\begin{aligned} MI( \Omega _s^{\circ } \sim c) ={H_c}\left( \varOmega _{s}^{\circ }\right) -H_c\left( \varOmega _{sc}^{\circ }\right) \end{aligned}$$

(21)

Let $J(f^{\circ })={H_c}\left( \varOmega _{s}^{\circ }\right) -H_c\left( \varOmega _{sc}^{\circ }\right)$, we get the selection criterion for our method:

$$\begin{aligned} f^*=argmax_{f^{{\circ } \in \varOmega _n}}(J(f^{\circ })) \end{aligned}$$

(22)

Implementation of CEFS+ method

Initially, we start with initialization. Let the feature set to be selected be the original feature set and the selected feature set be the empty set. Then, we use copula entropy to calculate the mutual information of each feature and class (MIFC), and select the feature with the largest mutual information as the main effect feature, which is put into the selected feature set and removed from the to-be-selected feature set.

Next, for each feature to be selected, we calculate ${H_c}\left( \varOmega _{s}^{\circ }\right)$ and $H_c\left( \varOmega _{sc}^{\circ }\right)$. At this moment, we can directly use ${H_c}\left( \varOmega _{s}^{\circ }\right) -H_c\left( \varOmega _{sc}^{\circ }\right)$, and then select the feature with the largest result as the feature $f^*$ for the current round of selection, which will be put into the set of selected features and deleted from the set of to-be-selected features (we refer to this method as CEFS, and the results of the method are shown in the following section). However, in practice, it may be due to the variability of the label and features that the fluctuation amplitude of ${H_c}\left( \varOmega _{s}^{\circ }\right)$ and $H_c\left( \varOmega _{sc}^{\circ }\right)$ has some small deviation, which makes the pre-selected feature selection is not very effective (disadvantaged in the early stages). Therefore, we propose an improved method for CEFS (CEFS+). Here we use the optimization algorithm of NSGA-II (Non-dominated Sorting Genetic Algorithm-II )⁴¹ to select the feature $f^*$. After calculating the ${H_c}\left( \varOmega _{s}^{\circ }\right)$ and $H_c\left( \varOmega _{sc}^{\circ }\right)$ for each feature, we can get the rank $R_f$ and $R_c$ (called the domination count) corresponding to ${H_c}\left( \varOmega _{s}^{\circ }\right)$ and $H_c\left( \varOmega _{sc}^{\circ }\right)$ for each feature. Then we replace ${H_c}\left( \varOmega _{s}^{\circ }\right) -H_c\left( \varOmega _{sc}^{\circ }\right)$ with $R_f-R_c$ and select the feature with the largest value of $R_f-R_c$ as $f^*$. This is similar to doing a normalization. In addition, the method handles the case of a tie (two features with the same value of $R_f-R_c$) by selecting the feature with the higher mutual information between feature and lable.

Repeat the steps in the previous paragraph until the termination condition is met.

Example 1

Let $F=\left\{ f_1, f_2, f_3, f_4, f_5\right\}$ be a set of five features. First, we make the set of features to be selected $\varOmega _n$ equal to F and compute MIFC by copula entropy for every feature and select the feature that has the maximum mutual information value. Let us assume feature $f_3$ has the highest mutual information value and hence, $f_3$ is removed from $\varOmega _n$ and is put in the optimal feature set, say $\varOmega _s$. Next, for a feature $f_j \in \varOmega _n$, we compute ${H_c}\left( \varOmega _{s}^{\circ }\right)$ and $H_c\left( \varOmega _{sc}^{\circ }\right)$ This way, we compute $f_1$, $f_2$, $f_4$ and $f_5$. Then we compute the corresponding rank $R_f$ and $R_c$. Consider the scenario shown in Table 3. Here, feature $f_2$ has the maximum difference between $R_f$ and $R_c$, i.e., 3. Hence feature $f_2$ will be selected.

Table 3 An example about general domination count of features ($R_f$ and $R_c$ are dominant counts).

Full size table

Example 2

If the value of $R_f-R_c$ is taken to be maximum, the corresponding feature is not unique, from which we select the feature with the highest MIFC. For example, in the case shown in Table 4, the $R_f-R_c$ values of $f_1$ and $f_5$ are maximum and same, but the MIFC value of $f_5$ is higher, so we choose $f_5$.

Table 4 An example of a tie in the dominant count of features ($R_f$ and $R_c$ are dominant counts. MIFS is the mutual information of feature and class).

Full size table

The flowchart of the CEFS+ algorithm is shown in Fig. 1.

Complexity analysis

The overall complexity of the proposed method depends on the dimensionality of the input dataset and the number of features planned to be selected. For a d dimensional dataset planning to select k features, the computational complexity of our method is O(kd). Typically, k is much smaller than d. In practice, when traversing the candidate feature set to compute the ${H_c}\left( \varOmega _{s}^{\circ }\right)$ and ${H_c}\left( \varOmega _{sc}^{\circ }\right)$ for each feature, we use the parallelization strategy.

Experimental result

The experiment is conducted on a workstation with 80 GB of RAM, 8 $*$ Intel(R) Xeon(R) Platinum 8358P CPU @ 2.60GHz and 64-bit ubuntu 20.04 operating system. We used Python 3.8 to implement our algorithm.

Dataset description

In our experimental, we use both nonbiological datasets and biological datasets. The nonbiological datasets include Sonar⁴² and TUANDROMD, which have a small number of features, and our main purpose is to observe the performance of the proposed method in low-dimensional cases. The biological datasets include arcene, BRCA⁴³ and pancan_tcga^44,45,46, which have a high number of features, and our main purpose is to observe the performance of the proposed method in the high-dimensional case. Table 5 gives the descriptive information of the datasets.

Table 5 Dataset description (Sonar and TUANDROMD are conventional datasets. Arcene, BRCA and pancan_tcga are high-dimensional gene datasets).

Full size table

Results and discussions

First, we select the corresponding subset of features from each dataset using the proposed method (CEFS+) and 7 comparison methods (these 7 comparison methods are CEFS - the basic version of CEFS+, MRMR, JMI⁴⁷, JMIM⁴⁸, Lasso, EN, ReliefF). Here, for each dataset, we get 8 lists with features ranked from highest to lowest priority. In order to evaluate the performance of each FS method more objectively, we chose three commonly used classifiers, namely Random Forest (RF)⁴⁹, Support Vector Machine (SVM)⁵⁰ and lightGBM (lgb)⁵¹. We used 10-fold cross-validation method 10 times to evaluate the validity of the selected features on different classifiers, with the predicted classification accuracy (hereafter referred to as accuracy) as the final evaluation metric. For all the mentioned datasets, the results of the highest accuracy of all the above mentioned feature selection algorithms are shown in the Table 6.

Table 6 Performance comparison (max-accuracy and the number of selected features for max-accuracy) of eight feature selection methods: CEFS+ (proposed method), CEFS (the basic version of CEFS+), MRMR, JMI, JMIM, Lasso, EN and ReliefF (The meaning of the bold values represent the best performance achieved on a certain dataset for the eight methods).

Full size table

From the results of the experiment in Fig. 2(a-e), it is clear that our proposed method has good performance in most of the scenarios. On five datasets, our proposed method obtains the highest accuracy on three of them, and the accuracy of the remaining two is not too far from the first place. Specifically, as shown in Fig. 3, out of 15 experiments, our proposed method obtains the highest accuracy in 10 times and the second highest accuracy in 3 times, which is the first place. The basic version of the proposed method, CEFS, obtains the highest accuracy in 2 times and the second highest accuracy in 5 times, which is the second place. The Lasso method obtains the highest accuracy in 2 times and the second highest accuracy in 4 times, which is the third place. Overall, CEFS also performs well, but the improved method, CEFS+, is superior. In addition, it is worth mentioning that the performance of CEFS+ method on high-dimensional datasets is outstanding. On the arcene dataset, the SVM-accuracy of the proposed method is 3.25% higher than the second place Lasso, and the lgb-accuracy is 1.2% higher than the second place JMIM; On the BRCA dataset, the RF-accuracy of CEFS+ method is 0.88% and the lgb-accuracy is 0.4% higher than the second place JMIM; On the pancan_tcga dataset, the RF-accuracy, SVM-accuracy and lgb-accuracy of CEFS$+$ are 1.57%, 3.36% and 2.15% higher than the second place Lasso.

Here, in order to analyze the feature selection more carefully, we select CEFS+, CEFS, and the third place Lasso, and draw the curve of classification accuracy with the number of selected features s. In the following, we analyze the curve with respect to four aspects: trend, stability, maximum value, and number of features s at maximum value.

On the Sonar dataset, since the number of original features is not very large, we select all the features step by step for easy observation. From the Fig. 4, it can be seen that on RF classifier, the performance of CEFS+ is better than CEFS and slightly lower than Lasso. On SVM classifier, CEFS+ has the highest accuracy of 85% (at s=25) and performs well in the early stage (when s $\le$ 25), and in the later stage, the accuracy of the three methods show a decreasing trend, which also shows the necessity of feature selection. On the lgb classifier, CEFS+ shows an overall upward trend, and is not as good as Lasso in the early stage (when s $\le$ 50), but possesses the highest accuracy rate of 89.28% (when s=51). Overall, on the Sonar dataset, the proposed method performs slightly worse than Lasso, but better than CEFS.

On the TUANDROMD dataset, we select 100 features step by step. From the Fig. 4, it can be seen that on RF classifier, CEFS+ method outperforms CEFS and Lasso, and has the highest accuracy of 98.48% (at s=21). On the SVM classifier, CEFS+ does not have a better performance than Lasso, but the overall trend is more stable and increasing. On the lgb classifier, CEFS+ fails to beat Lasso and CEFS in the early stage (when s $\le$ 56), and gradually rises in the later stage and has the highest accuracy rate of 97.85% (when s=87). It is worth noting that on all three classifiers, CEFS has a weaker performance in the early stage (when s $\le$ 19).

On the arcene dataset, because of the small number of samples, we select a relatively small number of features. Here, we select 50 features step by step. From the Fig. 4, we can see that on RF classifier, CEFS+ performs better than CEFS and Lasso. On SVM classifier, the proposed method significantly outperforms Lasso and CEFS, and possesses the highest accuracy of 96.5% when s=35. The proposed method is also better than CEFS and Lasso for the lgb classifier, and has the highest accuracy of 95.6% (when s=25). Overall, the proposed method performs much better.

On the BRCA dataset, we select 200 features step by step. As can be seen from the Fig. 4, on the SVM classifier, the CEFS+ performs very well in the early stage (when s $\le$ 50), which is significantly better than Lasso and CEFS. On the RF and lgb classifier, CEFS+ and CEFS significantly outperform Lasso. Overall, the proposed method performs better.

On the pancan_tcga dataset, we select 200 features step by step. From the Fig. 4, it can be seen that CEFS+ performs remarkably better than CEFS and Lasso on RF and SVM classifiers. On lgb classifier, CEFS+ is significantly better than Lasso, and the performance of the preliminaries (when s $\le$ 39) is also more excellent than CEFS.

Fig. 5 shows the variation of the average accuracy with the number of selected features. As can be seen, the proposed method has a better trend profile. On the arcene dataset, the highest average accuracy rate of 84.1% is achieved when the number of selected features is 24. On the BRCA dataset, both our method and JMIM perform well. On the pancan_tcga dataset, our method is better than other methods. The maximum average accuracy of each FS method on three sets of high-dimensional gene data is shown in Fig. 2f. From the figure, it can be seen that on the three datasets, our method significantly outperforms the other methods.

Interpretability of the CEFS+

One of the major advantages of FS over FE is its interpretability. FS gets the original features. And FE is to extract information from the original features to get some new features, which makes it far less interpretable than the former. Taking the data on genes and cancer as an example, only if we know which genes are critical in the development of cancer can we detect cancer early at the genetic level, thus facilitating subsequent diagnosis and treatment, which has been clinically proven to greatly improve the effectiveness of cancer treatment . If the key genes we get are “new genes” after FE, which do not exist in the whole genome, this is not useful for doctors, because it is not possible to treat and study them at the genetic level. So, what we need is the original trait.

Here, we take the pancan_tcga dataset as an example to analyze the interpretability of the CEFS+ algorithm due to its large sample size and number of features.

Table 7 Top 30 genes selected by CEFS+.

Full size table

As shown in Table 7, it is the 30 highest prioritization features obtained by our CEFS+ algorithm. we plot the following Fig. 6 based on the average copy number profiles of these 30 genes in six of the cancers. It is clear from the figure that the copy number profiles of the majority of the genes are mutated in at least one of the cancers, with significant variability compared to the other categories. The relationship between some of these genes and diseases is as follows:

The ZBTB7C gene, which is ranked 3rd in the sorting list of the CEFS+ algorithm, has an average copy value of -0.664 in COAD/READ (Colon adenocarcinoma/Rectum adenocarcinoma), which means that a large proportion of colorectal cancer patients have ZBTB7C copy values undergone deletion, it has been shown^52,53 that ZBTB7C, as a complex transcription factor, is mainly enriched in metabolic pathways and pathways in cancer, and this transcription factor affects fatty acid metabolism in tumor cells, thus altering the cell cycle of tumor cells. In addition, it has been noted⁵⁴ that ZBTB7C is down-regulated in colorectal cancer and is an early diagnostic marker and independent prognostic factor for colorectal cancer.
PCAT1, is ranked 7th in the CEFS+ algorithm sorting list, with an average copy value of 0.819 in breast cancers, which suggests that a large proportion of breast cancer patients have amplified copy values of PCAT1. It has been shown that PCAT1 is a long non-coding RNA that regulates BRCA2 (Breast Cancer Susceptibility Gene 2) and controls homologous recombination in cancer⁵⁵.
VOPP1, ranked 9th in the CEFS+ algorithm sorting list, has an average copy value of 1.044 in GBM (Glioblastoma multiforme), which implies that the vast majority of patients with glioblastomas have amplified copy values of the VOPP1 gene. It has been pointed out that VOPP1 is a gene that encodes a protein⁵⁶. VOPP1 protein is important in a variety of biological processes. VOPP1 plays a role in tumor development and progression. It is overexpressed in a variety of cancer types and is thought to be a protein that promotes tumor cell survival and proliferation.
MIR645, is ranked 15th in the CEFS+ algorithm sorting list, and its average copy value is 0.847 in COAD/READ cancer patients, which indicates that a large proportion of patients have amplification of the MIR645 gene at the time of copying. And its variation is also greater in ovarian plasma cystic adenocarcinoma and breast cancer. Several studies have shown that MIR645 is associated with many cancers. It is up-regulated and can act as an oncogenic miRNA in head and neck cancer, breast cancer and hepatocellular carcinoma^57,58,59. It has also been shown that MIR645 promotes invasiveness, metastasis and tumor growth in colorectal cancer by targeting EFNA5⁶⁰.
ASAP1-IT1, is ranked 23rd in the CEFS+ algorithm sorting list, with a mean copy value of 1.119 in patients with ovarian cancer (OV), which indicates that the ASAP1-IT1 gene is amplified in the majority of patients. It has been found that ASAP1-IT1 acts as a potential tumor suppressor lncRNA by upregulating large tumor suppressor 2 (LATS2) expression in ovarian cancer⁶¹.

Conclusions

In this paper, we propose a novel feature selection method CEFS+ which is based on copula entropy. The method computes the full-order mutual information of features via copula entropy, adding considerations of interactivity on the basis of relevance and redundancy. This allows our method to capture more useful information so that subsets can be selected more efficiently. We have improved the basic algorithm (CEFS) for cases where it has a pre-existing disadvantage. With the rank approach, the feature selection process becomes efficient and stable. We evaluated the classification accuracy of the method on five datasets using three classifiers, RF, SVM and LightGBM. Our method obtains the highest accuracy 10 times in 15 cases. While our method performs less well than Lasso on the first two small datasets, it outperforms the other seven methods (CEFS, MRMR, JMI, JMIM, Lasso, EN, ReliefF) on high-dimensional gene data where feature interactions exist. Our method obtains the highest average accuracy on all three high-dimensional gene datasets. In particular, the highest average accuracy of our method on the arcene and BRCA datasets is 84.1% and 89.8%, which is 3.2% and 3.7% higher than Lasso at this point. On the pancan_tcga dataset, the highest average accuracy of the proposed method is 2.5% higher than Lasso.

In current research, the CEFS+ method demonstrates excellent capabilities in feature selection and dimensionality reduction for high-dimensional data processing, with particularly strong performance in handling high-dimensional gene data. However, its algorithmic adaptability is insufficient when dealing with low-dimensional data, leaving space for optimization—efforts are therefore underway to generalize the method to other data types. Additionally, the method is yet unable to automatically determine the optimal number of feature subsets; work is ongoing to construct an evaluation framework to achieve this functionality. At the practical application level, due to the complex acquisition procedures and high costs associated with gene expression data, sample sizes remain limited, severely constraining the algorithm’s training optimization and generalization performance.In the future, we will focus on expanding data collection channels to further refine the algorithm. Meanwhile, we will also initiate the development of feature selection tools based on CEFS+ in parallel.

Data availability

Our processed data and code of this paper can be available at https://github.com/xr-Yan/CEFSplus.

References

Agarwal, S., Dhyani, A. & Ranjan, P. Newton’s second law based pso for feature selection: Newtonian pso. J. Intell. Fuzzy Syst. 37, 4923–4935 (2019).
Google Scholar
Guyon, I. & Elisseeff, A. An introduction to variable and feature selection. J. Mach. Learn. Res. 3, 1157–1182 (2003).
MATH Google Scholar
Blum, A. L. & Langley, P. Selection of relevant features and examples in machine learning. Artif. Intell. 97, 245–271. https://doi.org/10.1016/S0004-3702(97)00063-5 (1997).
Article MathSciNet MATH Google Scholar
Hsu, H.-H., Hsieh, C.-W. & Lu, M.-D. Hybrid feature selection by combining filters and wrappers. Expert Syst. Appl. 38, 8144–8150. https://doi.org/10.1016/j.eswa.2010.12.156 (2011).
Article Google Scholar
Dhal, P. & Azad, C. A comprehensive survey on feature selection in the various fields of machine learning. Appl. Intell. 1–39 (2022). https://doi.org/10.1007/s10489-021-02550-9.
Saeys, Y., Inza, I. & Larranaga, P. A review of feature selection techniques in bioinformatics. Bioinform. 23, 2507–2517 (2007). https://doi.org/10.1093/bioinformatics/btm344.
Chen, Z. et al. Feature selection with redundancy-complementariness dispersion. Knowl.-Based Syst. 89, 203–217 (2015). https://doi.org/10.1016/j.knosys.2015.07.004.
Hoque, N., Bhattacharyya, D. K. & Kalita, J. K. Mifs-nd: A mutual information-based feature selection method. Expert Syst. Appl. 41, 6371–6385. https://doi.org/10.1016/j.eswa.2014.04.019 (2014).
Article Google Scholar
Bhatt, R. B. & Gopal, M. On fuzzy-rough sets approach to feature selection. Pattern Recognit. Lett. 26, 965–975. https://doi.org/10.1016/j.patrec.2004.09.044 (2005).
Article ADS Google Scholar
Vignolo, L. D., Milone, D. H. & Scharcanski, J. Feature selection for face recognition based on multi-objective evolutionary wrappers. Expert Syst. Appl. 40, 5077–5084. https://doi.org/10.1016/j.eswa.2013.03.032 (2013).
Article Google Scholar
Kira, K. & Rendell, L. A. The feature selection problem: Traditional methods and a new algorithm. In Proceedings of the tenth national conference on Artificial intelligence, 129–134 (1992).
Kononenko, I. Estimating attributes: Analysis and extensions of relief. In European conference on machine learning, 171–182 (organizationSpringer, 1994).
John, G. H., Kohavi, R. & Pfleger, K. Irrelevant features and the subset selection problem. In Machine learning proceedings 1994, 121–129 (publisherElsevier, 1994).
Mitra, P., Murthy, C. & Pal, S. K. Unsupervised feature selection using feature similarity. IEEE Trans. Pattern Anal. Mach. Intell. 24, 301–312. https://doi.org/10.1109/34.990133 (2002).
Article Google Scholar
Tibshirani, R. Regression shrinkage and selection via the lasso. J. R. Stat. Soc. Series B Stat. Methodol. 58, 267–288. https://doi.org/10.1111/j.2517-6161.1996.tb02080.x (1996).
Article MathSciNet MATH Google Scholar
Zou, H. & Hastie, T. Regularization and variable selection via the elastic net. J. R. Stat. Soc. Series B Stat. Methodol. 67, 301–320. https://doi.org/10.1111/j.1467-9868.2005.00503.x (2005).
Article MathSciNet MATH Google Scholar
Battiti, R. Using mutual information for selecting features in supervised neural net learning. IEEE Trans. Neural Netw. 5, 537–550. https://doi.org/10.1109/72.298224 (1994).
Article CAS PubMed Google Scholar
Kwak, N. & Choi, C.-H. Input feature selection for classification problems. IEEE Trans. Neural Netw. 13, 143–159. https://doi.org/10.1109/72.977291 (2002).
Article CAS PubMed Google Scholar
Peng, H., Long, F. & Ding, C. Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy. IEEE Trans. Pattern Anal. Mach. Intell. 27, 1226–1238. https://doi.org/10.1109/TPAMI.2005.159 (2005).
Article PubMed Google Scholar
Estévez, P. A., Tesmer, M., Perez, C. A. & Zurada, J. M. Normalized mutual information feature selection. IEEE Trans. Neural Netw. 20, 189–201. https://doi.org/10.1109/TNN.2008.2005601 (2009).
Article PubMed Google Scholar
Cheng, g., Qin, Z., Feng, C., Wang, Y. & Li, F. Conditional mutual information-based feature selection analyzing for synergy and redundancy. Etri Journal 33, 210–218 (2011). https://doi.org/10.4218/etrij.11.0110.0237.
Foithong, S., Pinngern, O. & Attachoo, B. Feature subset selection wrapper based on mutual information and rough sets. Expert Syst. Appl. 39, 574–584. https://doi.org/10.1016/j.eswa.2011.07.048 (2012).
Article Google Scholar
Sun, X. et al. Feature selection using dynamic weights for classification. Knowl.-Based Syst. 37, 541–549. https://doi.org/10.1016/j.knosys.2012.10.001 (2013).
Article Google Scholar
Zeng, Z., Zhang, H., Zhang, R. & Yin, C. A novel feature selection method considering feature interaction. Pattern Recognit. 48, 2656–2666. https://doi.org/10.1016/j.patcog.2015.02.025 (2015).
Article ADS Google Scholar
Witten, I. H. & Frank, E. Data mining:practical machine learning tools and techniques. Acm Sigmod Rec. 31, 76–77 (2011).
Article Google Scholar
Gu, X., Guo, J., Li, C. & Xiao, L. A feature selection algorithm based on redundancy analysis and interaction weight. Appl. Intell. 51, 2672–2686. https://doi.org/10.1007/s10489-020-01936-5 (2021).
Article Google Scholar
Ma, J. Variable selection with copula entropy. arXiv preprint arXiv:1910.12389 (2019). https://doi.org/10.48550/arXiv.1910.12389.
Zhang, P., Liu, G. & Gao, W. Distinguishing two types of labels for multi-label feature selection. Pattern Recognit. 95, 72–82 (2019).
Article ADS Google Scholar
Zhang, P., Gao, W., Hu, J. & Li, Y. A conditional-weight joint relevance metric for feature relevancy term. Eng. Appl. Artif. Intell. 106, 104481 (2021).
Article Google Scholar
Zhang, P. & Gao, W. Feature relevance term variation for multi-label feature selection. Appl. Intell. 51, 5095–5110 (2021).
Article Google Scholar
Hu, L., Gao, L., Li, Y., Zhang, P. & Gao, W. Feature-specific mutual information variation for multi-label feature selection. Inf. Sci. 593, 449–471 (2022).
Article MATH Google Scholar
Li, Y., Hu, L. & Gao, W. Multi-label feature selection via robust flexible sparse regularization. Pattern Recognit. 134, 109074 (2023).
Article Google Scholar
Han, Q., Hu, L. & Gao, W. Feature relevance and redundancy coefficients for multi-view multi-label feature selection. Inf. Sci. 652, 119747 (2024).
Article MATH Google Scholar
Shannon, C. E. A mathematical theory of communication. acm sigmobile mob. Comput. Commun. Rev 5, 3–55 (2001).
Kraskov, A., Stögbauer, H. & Grassberger, P. Estimating mutual information. Phys. Rev. E. 69, 066138. https://doi.org/10.1103/PhysRevE.69.066138 (2004).
Article ADS MathSciNet CAS Google Scholar
Kendall, M. G. & Stuart, A. Standard errors of functions of random variables. The advanced theory of statistics 1, 231–232 (1969).
Google Scholar
Yu, L. & Liu, H. Efficient feature selection via analysis of relevance and redundancy. J. Mach. Learn. Res. 5, 1205–1224 (2004).
MathSciNet MATH Google Scholar
McGill, W. Multivariate information transmission. Transactions IRE Prof. Group on Inf. Theory 4, 93–111. https://doi.org/10.1109/TIT.1954.1057469 (1954).
Article MathSciNet Google Scholar
Ma, J. & Sun, Z. Mutual information is copula entropy. Tsinghua Sci. Technol. 16, 51–54. https://doi.org/10.1016/S1007-0214(11)70008-6 (2011).
Article Google Scholar
Wu, Q. & Li, D. Cria: An interactive gene selection algorithm for cancers prediction based on copy number variations. Front. Plant Sci. 13, 839044. https://doi.org/10.3389/fpls.2022.839044 (2022).
Article PubMed PubMed Central Google Scholar
Deb, K., Agrawal, S., Pratap, A. & Meyarivan, T. A fast elitist non-dominated sorting genetic algorithm for multi-objective optimization: Nsga-ii. In Parallel Problem Solving from Nature PPSN VI: 6th International Conference Paris, France, September 18–20, 2000 Proceedings 6, 849–858 (organizationSpringer, 2000). https://doi.org/10.1007/3-540-45356-3_83.
Jolicoeur-Martineau, A., Fatras, K. & Kachman, T. Generating and imputing tabular data via diffusion and flow-based gradient-boosted trees. arXiv preprint arXiv:2309.09968 (2023).
Rappoport, N. & Shamir, R. Multi-omic and multi-view clustering algorithms: review and cancer benchmark. Nucleic Acids Res. 46, 10546–10562. https://doi.org/10.1093/nar/gky889 (2018).
Article CAS PubMed PubMed Central Google Scholar
Ciriello, G. et al. Emerging landscape of oncogenic signatures across human cancers. Nat. Genet. 45, 1127–1133. https://doi.org/10.1038/ng.2762 (2013).
Article CAS PubMed PubMed Central Google Scholar
Cerami, E. et al. The cbio cancer genomics portal: an open platform for exploring multidimensional cancer genomics data. Cancer Discov. 2, 401–404. https://doi.org/10.1158/2159-8290.CD-12-0095 (2012).
Article PubMed Google Scholar
Gao, J. et al. Integrative analysis of complex cancer genomics and clinical profiles using the cbioportal. Sci. Signal. 6, pl1–pl1 (2013). https://doi.org/10.1126/scisignal.2004088.
Yang, H. & Moody, J. Data visualization and feature selection: New algorithms for nongaussian data. Adv. Neural Inf. Process. Syst. 12 (1999).
Bennasar, M., Hicks, Y. & Setchi, R. Feature selection using joint mutual information maximisation. Expert Syst. Appl. 42, 8520–8532. https://doi.org/10.1016/j.eswa.2015.07.007 (2015).
Article Google Scholar
Breiman, L. Random forests. Machine learning 45, 5–32. https://doi.org/10.1023/A:1010933404324 (2001).
Article Google Scholar
Platt, J. et al. Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. Advances in large margin classifiers 10, 61–74 (1999).
Google Scholar
Ke, G. et al. Lightgbm: A highly efficient gradient boosting decision tree. Adv. Neural Inf. Process. Syst. 30 (2017).
Hur, M.-W., Yoon, J.-H., Kim, M.-Y., Ko, H. & Jeon, B.-N. Kr-pok (zbtb7c) regulates cancer cell proliferation through glutamine metabolism. Biochimica et Biophysica Acta (BBA)-Gene Regulatory Mechanisms 1860, 829–838 (2017). https://doi.org/10.1016/j.bbagrm.2017.05.005.
Jeon, B.-N. et al. Kr-pok increases fasn expression by modulating the dna binding of srebp-1c and sp1 at the proximal promoter. JJ. Lipid Res. 53, 755–766 (2012).
Article CAS PubMed PubMed Central Google Scholar
Chen, X. et al. Zinc finger and btb domain-containing 7c (zbtb7c) expression as an independent prognostic factor for colorectal cancer and its relevant molecular mechanisms. Am. J. Transl. Res. 12, 4141 (2020).
CAS PubMed PubMed Central Google Scholar
Prensner, J. R. et al. Pcat-1, a long noncoding rna, regulates brca2 and controls homologous recombination in cancer. Cancer Res. 74, 1651–1660. https://doi.org/10.1158/0008-5472.CAN-13-3159 (2014).
Article CAS PubMed PubMed Central Google Scholar
Baras, A. S., Solomon, A., Davidson, R. & Moskaluk, C. A. Loss of vopp1 overexpression in squamous carcinoma cells induces apoptosis through oxidative cellular injury. Lab. Investig. 91, 1170–1180. https://doi.org/10.1038/labinvest.2011.70 (2011).
Article CAS PubMed Google Scholar
Sun, Q. et al. Dysregulated mir-645 affects the proliferation and invasion of head and neck cancer cell. Cancer Cell Int. 15, 1–8. https://doi.org/10.1186/s12935-015-0238-5 (2015).
Article CAS Google Scholar
Cai, Y., Li, W.-F., Sun, Y. & Liu, K. Downregulation of microrna-645 suppresses breast cancer cell metastasis via targeting dcdc2. Eur. Rev. Med. Pharmacol. Sci. 21 (2017).
Tao, J. et al. Microrna-645 represses hepatocellular carcinoma progression by inhibiting sox30-mediated p53 transcriptional activation. Int. J. Biol. Macromol. 121, 214–222. https://doi.org/10.1016/j.ijbiomac.2018.10.032 (2019).
Article CAS PubMed Google Scholar
Li, S. et al. Mir-645 promotes invasiveness, metastasis and tumor growth in colorectal cancer by targeting efna5. Biomed. Pharmacother. 125, 109889. https://doi.org/10.1016/j.biopha.2020.109889 (2020).
Article CAS PubMed Google Scholar
Wang, K., Hu, Y.-B., Zhao, Y. & Ye, C. Long non-coding rna asap1-it1 suppresses ovarian cancer progression by regulating hippo/yap signaling. Int. J. Mol. Med. 47, 1–1. https://doi.org/10.3892/ijmm.2021.4877 (2021).
Article CAS PubMed Google Scholar

Download references

Acknowledgements

This work was supported by National Natural Science Foundation of China (Grant No. 32470678), Basic Research Programs of Shanxi Province (202303021211069) and Key Research and Development Programs of Shanxi Province (202402020101008).

Author information

Xiaoran Yan and Shilong Shang: These authors contributed equally to this work.

Authors and Affiliations

College of Computer Science and Technology, Taiyuan University of Technology, Taiyuan, Shanxi, China
Shilong Shang & Dongxi Li
College of Artificial Intelligence, Taiyuan University of Technology, Taiyuan, Shanxi, China
Xiaoran Yan & Yun Dang

Authors

Xiaoran Yan
View author publications
Search author on:PubMed Google Scholar
Shilong Shang
View author publications
Search author on:PubMed Google Scholar
Dongxi Li
View author publications
Search author on:PubMed Google Scholar
Yun Dang
View author publications
Search author on:PubMed Google Scholar

Contributions

Xiaoran Yan: Data curation, Methodology, Writing - original draft. Shilong Shang: Data curation, Methodology, Writing - original draft. Dongxi Li: Conceptualization, Supervision, Writing - review & editing. Yun Dang: Formal analysis, Writing-original draft.

Corresponding author

Correspondence to Dongxi Li.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Yan, X., Shang, S., Li, D. et al. An efficient and interactive feature selection approach based on copula entropy for high-dimensional genetic data. Sci Rep 15, 30100 (2025). https://doi.org/10.1038/s41598-025-15068-8

Download citation

Received: 12 March 2025
Accepted: 05 August 2025
Published: 17 August 2025
Version of record: 17 August 2025
DOI: https://doi.org/10.1038/s41598-025-15068-8

Keywords

This article is cited by

A copula based supervised filter for feature selection in machine learning driven diabetes risk prediction
- Agnideep Aich
- Md. Monzur Murshed
- Amanda Mayeaux
Scientific Reports (2026)

Subjects

Abstract

Similar content being viewed by others

Compressed deepfake detection via GA-LASSO selection of deep features and machine learning models

Hybrid similarity relation based mutual information for feature selection in intuitionistic fuzzy rough framework and its applications

Double trouble: a comprehensive study into unrelated genetic comorbidities in adult patients with Facioscapulohumeral Muscular Dystrophy Type I

Introduction

Background

Types of FS

Applications of FS

Motivation and contribution

Related work

Theoretical foundations of information theory

Problem formulation and symbols

Information entropy

Mutual information

Relevance, redundancy and interaction

Definition 1

Definition 2

Definition 3

Definition 4

Definition 5

Copula entropy

Definition 6

Theorem 1

Proof

Proposed method

Derivation of the CEFS+ method

Theorem 2

Proof

Implementation of CEFS+ method

Example 1

Example 2

Complexity analysis

Experimental result

Dataset description

Results and discussions

Interpretability of the CEFS+

Conclusions

Data availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

This article is cited by

A copula based supervised filter for feature selection in machine learning driven diabetes risk prediction

Search

Quick links