Introduction

Micro-expressions are involuntary facial expressions of extremely short duration (0.04 to 0.2 seconds), acting as subtle yet powerful indicators of real emotions and intentions that individuals often attempt to conceal1. Unlike macro-expressions, which are usually consciously controlled and may not accurately reflect a person’s true emotions, micro-expressions provide a more reliable and distinct indication of emotion2. Accurately detecting these fleeting expressions has significant implications for various fields, including law enforcement, security, psychological research, and professional negotiations. Micro-expressions can help physicians observe subtle changes in a patient’s mood, which is particularly crucial for the early diagnosis of mental health conditions3. For instance, patients with mood disorders such as anxiety and depression may unconsciously display several typical micro-expressions, which can serve as diagnostic aids4. Certain neurological disorders, such as Parkinson’s and Alzheimer’s diseases, can affect patients’ facial muscle movement, thereby altering their facial expressions. By analyzing changes in micro-expressions, doctors can detect the condition at an earlier stage or monitor its progression. Micro-expressions also have value in pain assessment5, particularly in patients unable to speak or express themselves, such as infants, Alzheimer’s patients, or those in a coma. Changes in facial micro-expressions can provide valuable information regarding pain levels, aiding physicians in administering more precise treatments. During hemodialysis6, the use of micro-expression recognition technology can rapidly detect abnormalities in the patient’s condition, enable early intervention in unforeseen circumstances, and enhance operational efficiency. In doctor-patient interactions, micro-expression analysis can help doctors understand the patient’s true feelings or confusion, thereby improving the treatment process, enhancing patient trust, and facilitating communication. During surgery, the micro-expressions of surgical team members may reflect emotions such as nervousness, anxiety, or confidence. Timely detection of these changes can help the team regulate emotions and ensure smooth operations.

In the early stages of micro-expression research, research primarily focused on traditional computer vision techniques for feature extraction and classification. The groundwork for micro-expression analysis was laid by the pioneering work of Ekman and Friesen, who developed the Facial Action Coding System (FACS), which established a standardized system for classifying facial actions7. Researchers relied on manual feature extraction methods, such as Local Binary Pattern (LBP), to capture texture information in facial images. LBP is effective in highlighting local texture variations, therefore enabling effective discrimination of facial expressions8. However, it lacks the ability to incorporate temporal information, which is essential for accurate micro-expression analysis. To solve this problem, Zhao et al.9 introduced LBP-TOP, an extension that operates on three orthogonal planes and combines temporal information, which increases computational load but offers a more comprehensive feature set. Wang et al.10 further improved the method by proposing LBP-SIP, which effectively reduces redundancy and enhances computational efficiency. Optical flow11 is a technique for analyzing object motion between image frames, and its application in micro-expression recognition has been extensively explored. Liu et al.12 used the main direction averaged optical flow (MDMO) to capture regional facial motions. Liong et al.13 designed the bi-weighted oriented optical flow (Bi-WOOF) to weigh local and global motion cues, both of which have proven effective.

The emergence of deep learning14,15,16,17 signified a paradigm shift in micro-expression recognition research. Convolutional Neural Networks (CNNs) were one of the first deep learning models applied to this task. Patel et al.18 addressed the challenge posed by the limited sample size of micro-expression datasets by employing a pre-trained VGGNet for feature extraction through transfer learning. Subsequent work has focused on modifying the network architecture to accommodate the specific challenges of micro-expression data. For instance, Peng et al.19 mitigated overfitting by reducing the number of ResNet layers. To better capture spatial and temporal information, researchers have introduced hybrid models that integrate CNNs with recurrent neural networks (RNNs) or long short-term memory (LSTM) networks. These models employ CNNs to extract spatial features and RNNs to model temporal dependencies, thereby significantly enhancing recognition accuracy. Three-dimensional CNNs (3D-CNN) have also been used to jointly process spatial and temporal data, where Reddy et al. concentrated on regional 3D-CNNs to enhance computational efficiency. Cakir et .al20 utilized action units (AUs) to localize the most active facial landmarks and determine the most representative regional scale for each landmark in a detection task. Their study on variable-scale landmark patches for facial action unit (AU) detection, employing a vision transformer (ViT) with a perceptual attention mechanism, achieved significant results.

Recent advances in the field of micro-expression recognition have brought about a paradigm shift with the introduction of visual transformer-based models capable of capturing long-distance dependencies and processing data in parallel. The Visual Transformer (ViT)21,22 model of Dosovitskiy et al. has had a profound impact by applying the transformer architecture to image classification tasks. The model replaced traditional convolutional operations with a self-attention mechanism and exhibited exceptional scalability and performance on large-scale datasets. Since then, researchers have applied visual transformers to micro-expression recognition, and Liu et al.23 subsequently proposed a lightweight ViT model, which enhances micro-expression analysis via transfer learning. HTNet, a hierarchical transformer network that combines optical flow features of the facial region and addresses the limitations of previous models by considering the facial structure and local-to-global feature relationships, was introduced by Wang et al24.

Visual transformer models still face challenges such as high computational requirements and the need for large datasets. However, datasets in micro-expression recognition are usually limited and often struggle to meet these requirements25. This motivates continuous research to improve the efficiency and generalization ability of these models and to explore techniques such as data augmentation and adversarial training to enhance the effectiveness of limited datasets. In conclusion, the field of micro-expression recognition has evolved from manual feature extraction to deep learning-based approaches, and the latest visual transformer models show great potential26. However, there are still challenges in terms of computational efficiency, dataset limitations, and real-time analysis requirements, which remain the core challenges of current research27. To address these challenges, our research delves into the visual transformer-based micro-expression recognition technique and proposes a new approach to improve recognition accuracy and efficiency by taking advantage of the visual transformer.

Our approach introduces a hierarchical transformer network, HTNet, which integrates the optical flow features of specific facial regions to effectively capture features and processes the inherent spatial relationships in facial markers through the incorporation of a multilayer transformer module.24 To enhance the model’s ability to capture subtle features, we propose and implement a Learned Absolute Position Encoding (LAPE) module, which significantly improves the model’s ability to recognize subtle details, thereby optimizing the recognition accuracy. In addition, to mitigate the computational overhead associated with LAPE and to simplify the model, we propose an entropy-based selective removal technique for the attention layer and introduce a novel agent attention mechanism28. These innovations not only decrease the model parameters and computational requirements but also preserve the model’s ability to learn rich features, thereby achieving an effective balance between computational efficiency and representational capability. Finally, to address the limitation posed by the limited sample size of the micro-expression dataset, which constrains the model’s generalization ability, we integrate a data enhancement technique based on the diffusion model29. This approach enhances the detection accuracy and robustness of the micro-expression recognition model, making it more suitable for practical application scenarios.

In this study, we provide a comprehensive overview of our contributions to the field of micro-expression recognition, focusing specifically on innovations and improvements to existing visual converter-based models. We perform a systematic evaluation of the proposed approach, demonstrate its effectiveness across multiple datasets, and investigate the potential impact of micro-expression recognition techniques in practical applications.

Results

Experimental methodology

Our experiments aim to assess the effectiveness of the key components of our framework: the LAPE module, the ESAAT module, and the diffusion model-based data augmentation technique. We also compare our model’s performance with state-of-the-art methods and evaluate its generalization ability across a diverse dataset.

Experimental implement details

In this paper, we employ cross-entropy as the loss function, with Adam as the optimizer, a learning rate of \(5 \times 10^{-5}\), and 800 training epochs. The experiments were conducted on a system running Ubuntu 20.04 LTS (Focal Fossa), equipped with an Intel Xeon(R) Gold 6430 processor, an NVIDIA GeForce RTX 4090 GPU (24GB), and 120GB of RAM. The software environment includes Python 3.8 and CUDA 11.3.

Datasets

We employ four widely-used micro-expression datasets: SMIC30, SAMM31, CASME II32, and CAS(ME)333. These datasets provide a comprehensive range of spontaneous micro-expressions from various subjects, covering a range of emotional responses. The SMIC dataset contains 164 micro-expression sequences with three categories: positive, negative, and surprise. The SAMM dataset consists of 133 sequences with similar emotional categories. The CASME II dataset includes 145 sequences with a focus on spontaneous micro-expressions. The CAS(ME)3 dataset is the largest, containing 673 sequences and providing a more diverse and ecologically valid set of expressions.

Experimental metric

Owing to the imbalanced distribution of micro-expressions across the three categories in the micro-expression dataset, we employ the unweighted F1 score (UF1) and the unweighted average recall (UAR) as evaluation metrics for the model to objectively assess its performance.

UF1 evaluates the overall performance of the model across all categories by averaging the F1-scores of individual categories. Similar to the conventional macro-averaged F1-score (Macro F1-score), UF1 calculates the F1-score for each category and performs an unweighted average to prevent underrepresented categories from being overlooked. Specifically, for each category c, the F1-score is computed as follows:

$$\begin{aligned} F1_c = \frac{2 \times P_c \times R_c}{P_c + R_c} \end{aligned}$$
(1)

where, \(P_c\) (Precision) is defined as:

$$\begin{aligned} P_c = \frac{TP_c}{TP_c + FP_c} \end{aligned}$$
(2)

and the \(R_c\) (Recall) is defined as:

$$\begin{aligned} R_c = \frac{TP_c}{TP_c + FN_c} \end{aligned}$$
(3)

where \(TP_c\) represents the number of true positive instances (True Positives) for category c, \(FP_c\) denotes the number of false positive instances (False Positives) for category c, and \(FN_c\) refers to the number of false negative instances (False Negatives) for category c. The F1-scores of all categories are then averaged as follows:

$$\begin{aligned} UF1 = \frac{1}{C}\sum _{c=1}^{C}F1_c \end{aligned}$$
(4)

The UF1 is well-suited for handling category imbalance, as it prevents the metric from being disproportionately influenced by categories with larger data volumes. This metric evaluates the overall performance of the model across all categories while ensuring that the classification performance of underrepresented categories is not overlooked. UF1 ranges from 0 to 1, with values closer to 1 indicating better overall model performance.

Unweighted Average Recall (UAR) calculates the recall for each category and then averages these values. This metric assesses the model’s ability to recognize all categories while preventing the overall score from being skewed by categories with larger data volumes. UAR is defined as follows:

$$\begin{aligned} UAR = \frac{1}{C}\sum _{c=1}^{C}R_c \end{aligned}$$
(5)

where \(R_c\) represents the recall (recall rate), and C denotes the total number of categories. UAR quantifies the model’s ability to recognize all categories and is particularly suitable for datasets with imbalanced category distributions. This metric focuses solely on recall, reflecting the model’s effectiveness in recognizing samples from underrepresented categories. UAR ranges from 0 to 1, with higher values indicating greater average recall across all categories.

Comparative experiments

We compare the performance of our model with several state-of-the-art micro-expression recognition models, such as LBP-TOP10, Bi-WOOF13, OFF-ApexNet34, STSTNet35, MobileViT36, MMNet37, Micron-BERT38 and HSTA39. We conduct the experiments with K-fold cross-validation, with the final results presented in Table 1.

Table 1 Quantitative experiments compare the proposed method with representative approaches across three datasets.

Generalization experiments

To evaluate the capacity for generalization of our model, we performed experiments on the CAS(ME) 3 dataset, which is known for its diversity and ecological validity. We employed two evaluation strategies to assess model performance, and the experimental results are presented in Tables 2 and 3.

Cross-dataset validation: We performed K-fold cross-validation on the CAS(ME) 3 dataset to evaluate the model’s capacity to generalize to unseen data.

Impact of Data Augmentation: We compared the model’s performance with and without diffusion model-based data augmentation to quantify its effectiveness in improving generalization.

Table 2 Quantitative experiments evaluate the proposed method against representative approaches in a cross-dataset setting.
Table 3 Ablation study on data augmentation modules.

Ablation studies

An ablation study was performed to evaluate the respective contributions of the LAPE and ESAAT modules to the overall performance of our model. The experimental results are presented in Tables 4 and 5.

LAPE module ablation: We compared the performance of the model with and without the LAPE module to assess its role in capturing spatial relationships in microexpressions.

ESAAT module ablation: We analyzed the influence of the ESAAT module in reducing computational complexity while maintaining accuracy. Additionally, we examined the reduction in model parameters and its effect on recognition accuracy.

Table 4 Ablation study on LAPE modules.
Table 5 Ablation study on ESAAT modules.
Fig. 1
Fig. 1
Full size image

Comparative analysis of different module combinations in terms of resolution, parameters, and FLOPs.

From the results presented in tables and Fig. 1, it is evident that, compared to the micro-expression recognition model without the LAPE or ESAAT modules, the number of model parameters is reduced by approximately 18% with its inclusion, while the model demonstrates slight performance improvements across different datasets. These results indicate that the LAPE and ESAAT modules play a crucial role in balancing computational efficiency and expressive power by reducing computational overhead while enhancing the model’s representational capacity.

Results and discussion

The results of our experiments reveal the following key findings:

Comparative Experiments: Our model outperforms or matches the state-of-the-art methods in both accuracy and efficiency. The integration of the ESAAT module and data augmentation technique provides a competitive advantage, particularly in handling diverse and complex expressions.

Generalization Experiments: The generalization experiments on the CAS(ME) 3 dataset demonstrate that our model generalizes well to new data, with the data augmentation technique significantly enhancing performance.

Ablation Studies: The LAPE module significantly enhances the model’s ability to capture spatial relationships, resulting in higher recognition accuracy. The ESAAT module efficiently reduces the model’s computational complexity while maintaining accuracy.

The experimental results demonstrate the effectiveness of our proposed framework for micro-expression recognition. The LAPE and ESAAT modules, when integrated with diffusion model-based data augmentation, not only boost the model’s accuracy and efficiency but also substantially enhance its generalization capabilities. These findings underscore the potential of our framework for real-world applications that require accurate and robust micro-expression recognition.

Discussion

The paper concludes that the proposed micro-expression recognition framework, which combines HTNet with LAPE and ESAAT modules as well as diffusion model-based data augmentation, significantly improves the accuracy and efficiency of micro-expression recognition. The framework’s performance on multiple datasets demonstrates its potential for practical applications. Future work will focus on enhancing the model’s real-time inference capabilities and extending its multi-modal fusion capabilities. In the future, we will focus on leveraging Vision Transformers (ViTs) for multimodal fusion with adaptable patch sizes.

Methods

Method details

The methodology proposed for micro-expression recognition is a comprehensive framework that integrates advanced deep learning techniques, innovative attention mechanisms, and data augmentation strategies. Figure 2 presents the overall architectural diagram of the proposed method. This section provides a detailed explanation of the three core components of our approach: the Learnable Absolute Position Embedding (LAPE) module, the Entropy-based Selection Agent Attention (ESAAT) module, and the diffusion model-based data augmentation technique.

Fig. 2
Fig. 2
Full size image

The overview architecture diagram of the proposed model.

Learnable absolute position embedding module

The LAPE module is designed to enhance the model’s ability to capture the spatial dependencies within facial expressions. Traditional Vision Transformer models rely on fixed position embeddings, which may not be fully effective in capturing the nuances of micro-expressions. Our LAPE module introduces learnable position embeddings that adapt to the specific spatial features of facial movements.

The LAPE module functions as follows: 1) For each image patch, a unique position embedding is learned during the training process. 2) These embeddings are added to the patch embeddings, supplying the model with information regarding the relative positions of different facial regions. 3) The position embeddings are optimized alongside the rest of the model, enabling the network to better capture the spatial hierarchy of facial expressions.

Mathematically, the LAPE can be formulated as:

$$\begin{aligned} LAPE(x_i)=x_i+PE(p_i) \end{aligned}$$
(6)

where \(x_i\) is the embedding of the i-th patch, \(p_i\) is its position, and \(PE(p_i)\) is the learnable position embedding vector for that position.

Entropy-based selection agent attention module

The ESAAT module addresses the computational inefficiency of traditional attention mechanisms by selectively removing less relevant attention layers based on entropy measures. This approach reduces the model’s computational complexity without sacrificing performance.

Fig. 3
Fig. 3
Full size image

The schematic diagram of agent attention.

The ESAAT module operates through the following steps: 1) Compute the transfer entropy between each attention layer and the output layer to determine the importance of each layer. 2) Remove attention layers with low transfer entropy, as they contribute less to the final output, based on the transfer entropy values calculated. 3) Integrate a new attention mechanism, Agent Attention, which combines the advantages of softmax and linear attention to balance computational efficiency and representational power.

The Agent Attention (as shown in Fig. 3) mechanism can be mathematically represented as:

$$\begin{aligned} Att(Q,K,A,V)=Softmax(QA^T)Softmax(AK^T)V \end{aligned}$$
(7)

where Q (Query) represents the query matrix, which encodes the query vector of the input data. K (Key) denotes the key matrix, representing the key vector of the input data. V (Value) refers to the value matrix, which encapsulates the value vector of the input data. A (Agent Matrix) serves as an intermediary, regulating the interaction between the query and the keys. The function \(softmax(\cdot )\) denotes standard softmax normalization. Compared to the conventional self-attention mechanism, the Agent Attention mechanism introduces the Agent Matrix A, which decomposes the attention computation into two stages. In the first stage, the correlation between Q and A is computed. In the second stage, the correlation between A and K is computed. The two-stage attentional weighting process allows the query information to be modulated by the agent matrix before interacting with the key-value pairs. Additionally, this approach enhances flexibility. While traditional attention mechanisms compute the relationship between the query and key directly. The agent Attention introduces an intermediary mapping through the Agent Matrix A, enabling the model to capture more intricate attention patterns and operate in higher-order feature spaces.

Diffusion model-based data augmentation

The micro-expression datasets suffer from category imbalance and limited data distribution. To overcome the limitations posed by the small and imbalanced nature of micro-expression datasets, we employ a diffusion-model-based data augmentation technique. This method introduces diversity into the training data by gradually adding noise to the images and then training the model to reverse this process, generating new samples that resemble the original data while incorporating diverse expressions.

Fig. 4
Fig. 4
Full size image

Data augmentation based on diffusion model.

Specifically, an original micro-expression image \(x_0\) is first initialized, followed by the selection of a diffusion time step T and the definition of a noise schedule \(\beta _t\), which typically follows a cosine incremental strategy to ensure varying noise intensity at each time step. A sequence of Gaussian noise samples \(\epsilon _t \sim {\mathcal {N}}(0,I)\) is then generated to progressively perturb the image. Subsequently, at each time step ttt, the perturbation process is executed according to a predefined noise scheduling rule, which is mathematically formulated as follows:

$$\begin{aligned} x_{t}=\sqrt{\alpha _{t}} x_{0}+\sqrt{1-\alpha _{t}} \epsilon _{t} \end{aligned}$$
(8)

where \(\alpha _{t}=\prod _{i=1}^{t}\left( 1-\beta _{i}\right)\) represents the cumulative noise attenuation coefficient. This formula indicates that, at each step, the contribution of the original image \(x_0\) gradually diminishes, while the influence of the noise \(\epsilon _{t}\) progressively increases, ultimately resulting in pure Gaussian noise at \(t=T\). Finally, a denoising model (e.g., a diffusion model with a U-Net architecture) is trained to predict either the noise \(\epsilon _{t}\) or the clean image \(x_0\) directly. The optimization is performed using a mean square error (MSE) loss function:

$$\begin{aligned} {\mathcal {L}}={\mathbb {E}}_{x_0, \epsilon , t}[|| \epsilon - f_\theta (x_t, t) || ^ 2 ] \end{aligned}$$
(9)

The clear image is gradually restored by reverse denoising during inference, using the following update rule:

$$\begin{aligned} x_{t-1}=\frac{1}{\sqrt{1-\beta _t} } (x_t-\beta _tf_\theta (x_t,t))+\sigma _tz \end{aligned}$$
(10)

where \(\sigma _t\) represents the coefficient associated with noise intensity, and \(z\sim {\mathcal {N}}(0,I)\) denotes the random noise used for sampling.

Integration of components

The final model integrates the LAPE, ESAAT, and data augmentation techniques to establish a robust micro-expression recognition framework. The LAPE module provides the model with enhanced spatial awareness, the ESAAT module optimizes the attention mechanism for efficiency, and the data augmentation technique expands the dataset, thereby enhancing the model’s generalization ability.