Introduction

Promising advances are shown by the existing Generative adversely networks (GANs) from database of image generation to image translations1. Lot of fascinating attention is captured for creating human realistic images while changing the facial expressions2. In computer vision generator based deep neural networks like GANs takes an input image and translates it to the required target. But they have a limitation of preserving the image identity and facial features. Due to varied human expressions and image complexity, the facial expression analysis and synthesis is still a rolling problem in GAN based computer vision applications.

The facial synthesis application of computer vison has a wider scope of benefits in social media platforms like twitter. Twitter is big audience platform, seamlessly used by public to share their instant debates, from real time conversation and for self-expression of opinions. The profile image of the user called the Avatar, AVI, is an object of user identity and is initially provided when the user takes his account. This Avatar is static until the user changes it again and again. From business units to organizations, from great visionary to common users, twitter is the forum to share their opinions as tweets. From sorrow to agony, from happiness to sadness all varied feelings are commuted long way via tweets.

When it comes to tweets from great leaders to great organizations tweet sentiment is a guiding parameter for maintaining good communal relationships and followings. How different the triggering event might be like: climatic disasters to accidents and ethnic violence’s of the country, results in heavy raise of tweets. Whoever be the user from leader to normal, whatever be the user tweet sentiment, it was observed the user AVI, remained the same. When the nations most loved leader tweeted on a deadliest air crash, it was observed that his tweet carried lot of sorrow, but his AVI was all smiling. From psychology view, instances like these may affect people’s ethical conscience and a missed leveraging opinions on the person who has tweeted. To preserve social relationships on par with sentiments of words tweeted by a user, a method which can update the expression of the AVI with the expression from the tweet sentiment is needed and we take up this as a problem case of our study.

Facial expression synthesis is a deep learning approach that aims to adjust and alter an image’s facial expression3. The approach uses facial landmarks to reposition the facial movements from an example image to arrive at a target image. This expression synthesis has a great value and use in many computer vision based applications like Game animations, Face recognition, visual face language etc. In the facial expression translation, the synthesised expressions are then implanted into the target image. Good lot of state of art computer vision techniques used GANs for facial expression analysis and synthesis. Out of many advancements these methods still sketched many challenges and issues: high variability of human faces in terms of expression, shape and size, availability of large training data with uniform facial characteristics and synthesis of photo realistic expressions preserving the user identity.

Our work proposes a deep learning GAN model GAN-AVI, which is trained to translate the AVI facial expressions dynamically based on the tweet sentiment before the tweet actually reaches the crowd. Our proposed model GAN-AVI uses the approaches of custom facial landmark extraction, facial expression synthesis and translation.

Disasters like air crashes, rail accidents, landslides, unexpected floods which took huge human lives are some of the causes for heavy raise of the tweet count. A study was then stimulated to analyse the tweet sentiment and the AVI of the tweeter. While most of the tweets expressed sorrow, sadness and helpless sentiments, 70% of the tweeters AVIs were of smiling expression. The tweeter may be unmindful about the AVIs expression. When the emotional display in a picture does not align with the emotion conveyed in the related text, it may cause unease, erode trust, and may result in unintended social or cultural misunderstandings. And, instances like this may affect the ethical conscience of both the tweeter and the public. From psychological standpoint, concepts like Congruence Theory4, Expectancy Violation Theory5 establish a basis for comprehending these impacts. Tackling sentiment-expression incongruence is thus both a psychological requirement for authentic human interaction and a technical necessity for creating fair, reliable communication systems.

These multimodal sentiment mismatches have motivated us to think for the proposed approach that could address sentiment expression incongruence and promote communication that is accurate, fair, and trustworthy. Today in this AI driven world, why not to have an approach that dynamically synthesis facial expressions according to the tweet sentiments and translate the tweeter AVI accordingly. Synthesising such photorealistic AVIs, analogous to tweet sentiment is the motivational point of our research, and we are sure this work can have a great value in many applications pertaining to social media. The proposed work of synthesising the facial expressions matching to the tweet sentiment is a novel approach which definitely attracts the attention of many social media platforms. Our novel approach GAN-AVI, used deep learning based generative adversarial nets for sentiment based facial expression synthesis. The main contributions of the proposed work are: tweet sentiment label extraction framework and a framework to dynamically transform the AVI facial expression analogous to the tweet sentiment.

Many research questions have guided us to identify the gaps inline to our proposed objectives. Our investigation gains direction and novelty through the following set of questions: In what ways do inconsistencies between the sentiment conveyed in text and the expressions shown in profile images influence user perceptions of credibility and trust; what measurable metrics emerge in multimodal AI models to resolve the discrepancies? Can modification of profile images using generative adversarial networks (GANs) effectively minimize or remove these discrepancies while maintaining the person’s identity and genuineness?

The study is organized as follows: Sect. 2 surveys deep learning and GAN models for facial-expression synthesis and identifying gaps that this study would address, Sect. 3 describes the methods of tweet sentiment extraction and facial expression synthesis and translation, model training process over the dataset, Sect. 4 presents the descriptions of various datasets, including custom 32-landmark dataset, and includes model configurations and evaluation metrics. This section also includes comparison studies of the proposed model with few baseline models, highlighting the key findings related to AVI similarity. Finally, Sect. 5 concludes the study with a summary of the work, narrating the challenges faced and with future directions recommended.

Related work

Deep learning has transformed facial expression recognition and sentiment analysis by allowing end-to-end systems to learn intricate feature hierarchies directly from raw data. This section includes state of art research in deep learning for sentiment analysis and facial expression synthesis. GAN variants for various image generalizations are explored. This section also explored the psychological perceptive on multimodal sentiment mismatches. Related work pertaining to recent research on facial expression synthesis, identifying the gaps that this study aims to address are included.

Works proposed in6 offer an extensive overview of deep facial expression recognition, emphasizing the shift from manually designed features to convolutional and transformer-based models. Recent researchers are progressively investigating multi-task and multimodal learning, integrating visual, textual, and audio inputs for enhanced emotional comprehension7,8. Few-shot transfer learning for sentiment analysis utilizing facial expressions is proposed9 to address low-data constraints. This framework seamlessly integrates transfer learning and few-shot adaptation to allow precise sentiment inference from facial expressions, even with few labeled data, improving usability in actual situations with restricted sentiment–expression pairs. This few-shot adaptation and domain generalization approach provide a robust basis for sentiment-aware systems in environments with limited data distributions.

GANs introduced in 201410,11,12 are deep learning models proposed with the concepts of generative and adversarial processes. The model is trained with: the generator, discriminator. The generator produces a dataset of images where the discriminator is expected to predict these images as real or fake. The objective of the models training is to make the discriminator recognize the images from generator as real. Image key features I(X) along with a random noise (Z) are taken input to the generator during each training iteration. The output from the generator is a fake image G(Z). The input to the discriminator is the image X and fake image G(Z). If the discriminator generates a high probability supporting G(Z) is real, then the model training is done. There are many supporting papers on GANs and their variants12,13,14,15. Within few years of publishing GANs in 2014, many GAN variants were published from the research community. Few of them are discussed here in this section.

CycleGANs are used to learn image transformations16,17,18, transformation from one image style to another. The primary goal of CycleGAN is to learn a mapping function G: I1→I2 and the objective is the generated images G(I1) are indistinguishable from those of I2. Minimal adversarial loss of the mapping is the achievable constraint. The approach not only learns the mapping G, but also learns the inverse mapping F: I2→I1 and studies cyclic consistency loss by enforcing F(G(I1)) = I1. Conditional GAN (cGAN), extending GANs to a conditional model is studied in19. Mirza and Osindero trained the GANs using conditions on domain specifications. In their work both generator and discriminator were conditioned to learn additional information. This extra information can be anything from model labels to model specifications. Their approach showed on how simple vanilla GAN could be extended to cGAN by simply adding the Image class labels as an additional information. Auxiliary classifier GAN (ACGAN)20 an extension to cGAN adopts a training on these classes where the discriminator also outputs the probability of the class, along with the probability that the image is fake. ControlGAN, the Controllable Generative Adversarial Network21 is a variant of ACGAN and uses data augmentation techniques for addressing models overfitting. The work showed a tradeoff between the discrimination and classifier accuracy while using the augmentation techniques in GANs. The authors used new score metric called the Inception score for measuring the performance of the generator in generating the new samples.

The advent of computer vision technology made images to be highly understandable and made new image generation easier. Computer vision technology is used in many real time practical applications from creating realistic images to identifying the morphed images. Face synthesis22,23,24 is a way of creating fake faces resembling realistic faces where facial key features are given input to synthesize a target image. Image to image translation25,26 is another focused application of computer vision with a training goal of mapping an input image to an output image. With the onset emergence of GANs and Auto-encoders, one could see a tremendous progress in designing and building various deep neural network architectures for synthesizing photorealistic human faces. This section discusses few of these architectures.

DA-GAN, dual agent GAN for face synthesis is discussed in27. The works presented a GAN variant that preserves pose, texture and face identity. The designed model used dual agent discriminator, one for pose discrimination and the other for identity discrimination. The works populated the performance analysis with new metrics like the pose perception loss and the identity perception loss. Authors evaluated the possible loss of recognition via generation on the benchmark dataset IJB-A and showed outperformed results. TP-GAN, two-pathway Generative Adversarial Network (TP-GAN) for photorealistic frontal view synthesis by simultaneously perceiving global structures and local details is discussed in28. Patch networks of varied landmarks are proposed to capture the local textures in addition to the commonly used global encoder-decoder network. The works targeted on extracting a photorealistic and identity preserving frontal view of the image IF from a face image under a different pose, i.e. a profile image IP. To train such a network, pairs of corresponding {IF, IP} from multiple identities y are required during the training phase. Both the input IP and output IF come from a pixel space of size W × H × C with C color channel.

Facial Landmarks are the essential attributes of the face, representing key parts of the human face like eyes, mouth, jaw, eyebrows. The importance of these landmarks is, they preserve the facial geometry and play a key role in face alignment, face shift and swapping, head pose estimation, eye blink detection and much more. Many state-of-art approaches have defined various algorithms29,30 for landmark detection from input images. These facial landmarks are structured giving feasibility to edit and to generate target facial expressions31,32. Works presented in33 showed face synthesis from facial landmarks. Regression based approaches34,35 are discussed where target landmarks are estimated for facial synthesis.

From psychological view verbal and nonverbal clues interact intricately in communication, where alignment between words and expressions shapes how messages are perceived. Psychology has consistently demonstrated that facial expressions serve as strong indicators of hidden emotions, frequently holding greater significance than spoken words. When an image’s expression conflicts with the sentiment of the accompanying text, it may cause cognitive dissonance, leading to decreased trust and perceived authenticity. These inconsistencies may cause observers to doubt the communicator’s motives, genuineness, or emotional condition.

Unmasking the Face36 is a ground breaking study demonstrating that facial expressions serve as universal, physiological indicators of emotional conditions. The works showed that people unconsciously assess spoken or written communications alongside the related facial expressions, with inconsistencies frequently regarded as dishonest or disingenuous. The findings highlight the central role of non-verbal channels in shaping interpretation, even when textual sentiment is explicit. This principle directly applies to digital contexts, where profile pictures and written posts coexist as part of a unified communication channel. Building on this psychological foundation, works discussed in37 investigate fairness issues in multimodal emotion recognition systems. Authors demonstrate that inconsistencies between text and facial expressions can result in skewed predictions, especially when one modality predominates the model’s decision-making, echoing a technological analogy to cognitive bias and selective attention in human perception. Additionally, these mistakes can unevenly affect marginalized demographic groups, leading to disparities in perception and treatment, which psychology links to stereotype activation and social bias.

Research in media psychology highlights the significance of aligned visual and verbal cues in creating precise perceptions. Studies in38 found that when a profile picture doesn’t match the profile text, people are less likely to believe that the profile is real, highlighting inconsistency hurts trust in digital communication. The study results back up the psychological idea that when facial expressions and written words don’t match up, it can change how people see you and hurt your trustworthiness. Many areas of HCI have put efforts to understand how text and image interact in communication spaces and revealed similar concerns. Furthermore, research in39 examined the impact of prominent features in images on emotions and the relationship between these emotional signals and textual sentiment, showing that visual prominence significantly influences emotional interpretation.

During recent years facial landmark detection and its related applications like emotion recognition, emotion synthesis, and multimodal sentiment analysis have attracted considerable interest owing to progress in deep learning and geometric feature representation. Research conducted between 2023 and 2025 has investigated various avenues, such as lightweight landmark detection, pose changes, and the incorporation of landmark based features with generative models for synthesizing facial expressions. Parallel advancements have been noted in recommendation systems, integrating sentiment analysis with visual indicators for tailored experiences. Table 1 presents a unified comparison of chosen recent studies, emphasizing their datasets, methods, areas of application, and gaps identified to inline our proposed objectives.

Table 1 Comparative analysis of related work (2023–2025).

Although there has been notable advancement in multimodal sentiment analysis, current studies still have critical gaps: Inadequate multimodal fusion strategies, limited empirical evidence where experiments are simulated with model based metrics rather than using image similarity metrics as highlighted in Table 1. Existing GAN driven facial animation techniques such as GANimation, X2Face are proficient in expression transfer but lack the incorporation of sentiment analysis frameworks to guarantee that the produced expressions are contextually aligned with the related text.

Observing these gaps GAN AVI aims the following objectives: develop a GAN-centric framework for altering twitter AVI that matches facial expressions to tweet sentiment; to evaluate the effectiveness of the suggested framework relative to current facial animation techniques e.g., GANimation, X2Face, regarding realism, emotional alignment, and computational efficiency.

Methodology

This section discusses the approaches used in our proposed work: from text sentiment extraction to face expression transformation. Our proposed approach includes two frameworks: Text sentiment label extraction framework; Customized face expression synthesis framework.

Text sentiment label extraction framework: proposed \(\:\left(\:\frac{\varvec{L}}{\varvec{K}\:}\right)\)-MCS-GAN

This section includes the sentiment label extraction framework where we took the advantage of word masking. State of art research showed the usage of mask to hide some parts of the text while training the generator. Works proposed in CT-GAN44, Info-GAN tried to integrate categorical labeled data information into the data being generated. Many of these models use this label as the target to be predicted as discussed in Multi-Crise-Cross Attention and StyleGANv2 Generative Adversarial Network (MCS-GAN)45. In our work we propose \(\:\left(\:\frac{L}{K\:}\right)\:\)fold–MCS-GAN to generate labeled data. In the generator module we used \(\:\left(\:\frac{L}{K\:}\right)\)-fold mask and BiLSTM for encoding and decoding. For labeled text generation we used classifier and discriminator. Figure 1 shows the module architecture with generator and discriminator.

\(\:\left(\:\frac{\varvec{L}}{\varvec{K}\:}\right)\)-fold mask

Word masking approach is naively followed to improve the generator prediction accuracy. Inspired by such masking which has drastically reduced neural network overall accumulated errors, we proposed an (L/K)-fold mask method for training the GAN model with the text based tweets.

Let Ti = (w1,w2,….wn) be an ith tweet, with wn words. Construct a sequence of word positions P, showing the position of each word in the tweet Ti. Let P=(p1,p2,p3…pn) be the position sequence. A randomizer is used to pick a random sequence (Q) of (L/K) positions from P given by Qj = (qj1,qj2.qjr), qjr\(\:\:\in\:\:\) {0,1}, where j = L/K, L is the length of the tweet, with a desired K. Stochastic approach is used to generate the masks on words of Qj wherein gjt =0 shows the word at position t, i.e. wt is replaced with a mask token (\(\:\alpha\:\)), and if gjt =1, the word is not masked.

$$\:{\text{g}}_{\text{jt}}\:=\:0\:,\:\:\text{t}\in\:\left(j-1\right)F\:=\hspace{0.17em}1,\:\text{o}\text{t}\text{h}\text{e}\text{r}\text{w}\text{i}\text{s}\text{e}\:\text{f}\text{o}\text{r}\:\text{j}\hspace{0.17em}<\hspace{0.17em}\text{k}\:\text{a}\text{n}\text{d}\:\text{L}/2\hspace{0.17em}<\hspace{0.17em}\text{K}\hspace{0.17em}<\hspace{0.17em}\text{L}$$
(1)

Where F = floor(L/K), K > 0.

The GAN generator module of our proposed approach used BiLSTM classifier for operating as encoder-decoder. For a tweet vector T = (w1,w2,….wn), with Tc as true tweet label, with \(\:\mathcal{l}\)c the controllable information for true label, a masked tweet vector is generated, TM = (w’1,w’2,….w’n). This masked tweet vector is TM is encoded by BiLSTM encoder as Z = (e1,e2,….er), Z is then decoded into a generated tweet vector T\(\:^\circ\:\)= (ῶ1, 2,…. ῶ n). The words in the tweet are predicted using the class conditional probability function given by Eq. 2.

$$\:\hspace{0.17em}\text{P}(T^\circ\:|\:\text{Z})=\hspace{0.17em}\text{P}(\tilde{\omega }1\:\left|\:\text{Z}\right)\:\prod\:_{2}^{n}P\left(\tilde{\omega }t\right|\tilde{\omega }t-1,\:Z),\:\text{t}\hspace{0.17em}<\hspace{0.17em}\text{n}$$
(2)

The BiLSTM hidden units are updated using.

$${{\text{h}}_{\text{t}}}={\text{ H}}\left( {{{\text{y}}_{{\text{t}} - {\text{1}}}},{\text{ }}{{\text{h}}_{{\text{t}} - {\text{1}}}},{\text{z}}} \right)$$
(3)

Here yt-1 is embedded vector of its previous word wt-2, fed as an input to BiLSTM tth stage.

Given below are the computational equations for modelling the BiLSTM:

$${{\text{I}}_{\text{t}}}={\text{ Sigmoid}}\left( {{{\text{W}}_{\text{i}}}*{{\text{X}}_{\text{t}}}+{\text{ }}{{\text{U}}_{\text{i}}}*{{\text{h}}_{{\text{t}} - {\text{1}}}}+{\text{ }}{{\text{b}}_{\text{i}}}} \right)$$
(4)
$${{\text{F}}_{\text{t}}}={\text{ Sigmoid}}({{\text{W}}_{\text{f}}}*{\text{Xt}}\,+\,{{\text{U}}_{\text{f}}}*{\text{h}}{{\text{t}}_{ - {\text{1}}}}+{\text{ }}{{\text{b}}_{\text{f}}})$$
(5)
$${\text{C}}{{\prime }_{\text{t}}}={\text{ tanh}}\left( {{{\text{W}}_{\text{c}}}*{{\text{X}}_{\text{t}}}+{\text{ }}{{\text{U}}_{\text{c}}}*{{\text{h}}_{{\text{t}} - {\text{1}}}}+{\text{ }}{{\text{b}}_{\text{c}}}} \right)$$
(6)
$${{\text{C}}_{\text{t}}}={\text{ }}{{\text{I}}_{\text{t}}}*{\text{C}}{{\prime }_{\text{t}}}+{\text{ }}{{\text{F}}_{\text{t}}}{{\text{C}}_{{\text{t}} - {\text{1}}}}$$
(7)
$${{\text{O}}_{\text{t}}}={\text{ Sigmoid}}\left( {{{\text{W}}_{\text{o}}}*{{\text{X}}_{\text{t}}}+{\text{ }}{{\text{U}}_0}*{{\text{h}}_{{\text{t}} - {\text{1}}}}+{{\text{b}}_0}} \right)$$
(8)
$${{\text{h}}_{\text{t}}}={\text{ }}{{\text{O}}_{\text{t}}}*{\text{relu}}\left( {{{\text{C}}_{\text{t}}}} \right)$$
(9)

where It, Ft, Ct and Ot are the input gates, forget gates, memory cells and output gates of the BiLSTM at time t. Wi,Wf,Wc,Wo and Ui,Uf,Uc,Uo are the weights corresponding to input xt, hidden units ht-1, in respective gates. Our proposed approach is different to VANILA-LSTM, BILSTM with \(\:\left(\:\frac{L}{K\:}\right)\)fold masking of tweet words Wn concatenated to the input word embedding Xt. The embedding procedure of our approach is described in Algorithm 1.

Algorithm 1
figure a

Embedding procedure in generator.

From the sequence of word tokens, the generator selects the first token tg and generates the output tweet sentence Gg. Next this sentence is given to the descriptors for reward. The maximum expected reward gained by the generator is given by:          

$$J(\varDelta\:\text{g}) = E(Rs\mid Wo, lc, \varDelta\:\text{g}) = \:{\sum\:}_{(dg \epsilon W)}G\varDelta\:\text{g}\:(dg\mid Wo,lc)\:P(\varDelta\:\text{d}, \varDelta\:\text{c})$$
(10)

Where Rs is the reward of the tweet sentence, Wo is the start word, \(\:\mathcal{l}\)c is the controllable information, \(\:\varDelta\:\)g, \(\:\varDelta\:\)d, \(\:\varDelta\:\)c are the generator, discriminator, classifier parameters respectively, dg represents the generated token and P(\(\:\varDelta\:\)d, \(\:\varDelta\:\)c) is an action value function given by:                                                                                                   

$$P(\varDelta\:\text{d}, \varDelta\:\text{c}) = (2D(\varDelta\:\text{d}) D(\varDelta\:\text{c}))/(D(\varDelta\:\text{d})+D(\varDelta\:\text{c}))$$
(11)

Where D(\(\:\varDelta\:\text{d})\:\)represents the probability of a sentence being real, D(\(\:\varDelta\:\text{c})\) is the probability of the sentence with right label.

From Eq. (10) (11) the parameter update of the generator is derived as:

$$\:\varDelta\:\text{g}=\varDelta\:\text{g}+{\alpha\:}_{t}\:J\left({\varDelta\:}_{g}\right)$$
(12)

Where \(\:\alpha\:\)t is the stepwise learning rate.

Fig. 1
figure 1

Sentiment label extraction framework.

In the Discriminator module we used the BiLSTM classifier. As we are using a GAN model to train on a true tweet sentence with controllable information \(\:\mathcal{l}\)c, as well for outputting a generated sentence, we defined our GAN loss function as the linear sum of two losses from classifier given by:

$$\:\text{L}_{\text{c}}^{\prime}\:=\:\text{L}\text{o}\text{s}\text{s}\left(\text{p}\text{a}\text{r}\text{m}\right(L(\text{t}\text{r}:\:\varDelta\:\mathcal{l}\text{c}),\:\text{T}\text{r}\left)\right)\hspace{0.17em}+\hspace{0.17em}\text{L}\text{o}\text{s}\text{s}\left(\text{p}\text{a}\text{r}\text{m}\right(L\left(\text{G}\left(\text{Z}:\:{\varDelta\:}_{g}\right):\varDelta\:\:{\mathcal{l}}_{c}\right),\:o\left({\mathcal{l}}_{c}\right))$$
(13)

Here LHS first term is the true sentence loss, second term is the generated sentence loss. The notations are L(.) is the loss from the classifier with three categorical labels Positive and Negative and Neutral, o(\(\:\mathcal{l}\)c) is label generation controllable linear function, \(\:\varDelta\:\)g, \(\:\varDelta\:\)c are generator and classifier parameters, Tc is the true data label, tr real token, \(\:\epsilon\)() is the expectation. With the classifier loss so defined by Eq. (13), our GAN discriminator can be described as:

$$\:{\mathcal{L}oss}_{D}=\text{L}{\epsilon}_{\text{d}\text{g}\to\:(c,z)}+\text{L}{\epsilon}_{\text{d}\text{r}\to\:D}$$
$$\:\text{L}{\epsilon}_{\text{d}\text{g}\to\:(c,z)}={\epsilon}_{\text{d}\text{g}\to\:(c,z)},[-\text{l}\text{o}\text{g}(1-\text{P}\left(G\left(\text{o}\left(\text{c}\right),\text{z},{\varDelta\:}_{g}\right)\right),{\varDelta\:}_{d}\:]$$
$$\:\text{L}{\epsilon}_{\text{d}\text{r}\to\:D}={\epsilon}_{\text{d}\text{r}\to\:D},[-\text{l}\text{o}\text{g}(\text{P}({\text{T}}_{c},{\text{t}}_{r},{\varDelta\:}_{d})\left)\right]$$
(14)

With generated token (dg), real token(tr), controllable information(\(\:\mathcal{l}\)c), real data label(Tc), and P is the true data distribution. The so defined GAN model is trained inducing the uniform probability distribution to generate the sequence of tokens to form the whole labelled tweet sentence. Figure 1 presents the sentiment extraction framework which is trained to label a piece of text with three basic sentiments (Tc): (1) Positive, (2) Negative, (3) Neutral.

Fig. 2
figure 2

Facial expression synthesis framework.

Customized face expression synthesis framework

This section presents the approach of how facial expression synthesis is done with custom 32-Landmarks. As our approach depends on majorly three facial sentiments: Happy, Sad and Neutral, we have taken only 32 custom landmarks corresponding to mouth and eyes facial components while in comparison to state of art facial component detection approaches46,47 which have taken 68 − 42 landmarks.

OpenFace Dlib tool was used to extract 32 Landmarks. These landmarks are customized by changing the landmark locations (changed distances as discussed in Table 2) so as to reenact the facial expressions analogous to the Tweet sentiment extracted from Sect. 3.1. OpenFace Dlib is a facial landmark point model to detect 68 points of face as shown under Table 3. From these 68 points we have chosen 32 landmark points related to eyes and mouth.

Table 2 Target landmarks mappings.
Table 3 OpenFace Dlib landmark points.

Proposed GAN-AVI landmark and facial expression synthesis approach

Facial expression synthesis framework is illustrated in Fig. 2. The input is the AVI of the user and OpenFace landmark generator, as discussed in48,49,50, was used to detect and extract component key landmarks of the input image. These extracted landmarks are then used for changing the AVI expression analogous to the text sentiment. Each face image is represented by a set of k landmarks, which identifies the facial key components like: mouth, eye etc. The (x, y) co-ordinate representation of these landmarks is given by :

$${\text{K}}={\text{ }}[\left( {{l_{\text{x}}}^{{\text{1}}},{l_{\text{y}}}^{{\text{1}}}} \right),{\text{ }}\left( {{l_{\text{x}}}^{{\text{2}}},{l_{\text{y}}}^{{\text{2}}}} \right),{\text{ }}\left( {{l_{\text{x}}}^{{\text{3}}},{l_{\text{y}}}^{{\text{3}}}} \right), \ldots ..{\text{ }}\left( {{l_{\text{x}}}^{{\text{k}}},{l_{\text{y}}}^{{\text{k}}}} \right)]$$
(15)

Extensive identity preserving steps like scaling, rotation and translation are conducted to align the key components in the key areas like: eyes to be located at the same position, mouth to be located below the nose. Based on our required Target category of sentiment: Positive(P), Negative(N), Neutral(Ne) we split the landmarks into subsets of three categories as shown in Table 4.

Table 4 Landmark subsets.

AVI-Smile(F1): This is the landmark set that includes the landmarks required for changing the profile AVI to AVI depicting positive sentiment. Since eyes and mouth are the key facial components enriching smile, this set includes landmarks of mouth and eye.

AVI-Sad (F2): This is the landmark set that includes the landmarks required for changing the profile AVI to AVI depicting Negative sentiment. Landmarks of mouth and eye are included in this set.

AVI-Neu(F3): This is the landmark set that includes the landmarks required for changing the profile AVI to AVI depicting Neutral sentiment. Landmarks of Mouth, eye are included in this set.

The mapping F1F2 narrates a conversion of smile expression to sad expression and F2F1 narrates from sad to smile. The 32 landmarks extracted from AVI image as (x, y) ordinates are first grouped into three subsets as discussed above and each is converted into vectors. These vectors are then given input to our landmark converter for facial expressing conversion that matches to the tweet sentiment from Sect. 3.1, so as to arrive at the target landmark. To prepare the target landmark image we used reduced Euclidian distances as shown in Table 2. The proposed GAN-AVIs goal is to learn on a mapping G: z→y, where z is random noise and y is target output image. And what is needed from the generator is to generate Fake images, which can be identified real by the discriminator. The objective of the GAN deep learning model is as stated by Eq. 16.

$${{\text{L}}_{{\text{GAN}}}}\left( {{\text{G}},{\text{D}}} \right)\,=\,{{\text{E}}_{\text{y}}}[{\text{log}}\left( {{\text{D}}\left( {\text{y}} \right)} \right]\,+\,{{\text{E}}_{\text{z}}}\left[ {{\text{log}}\left( {{\text{1}} - {\text{D}}\left( {{\text{G}}\left( {\text{z}} \right)} \right)} \right)} \right]$$
(16)

Equation (16) represents the adversarial goal employed to train the GAN conditioned on landmarks, where the generator G aims to create real facial images based on landmark inputs, while the discriminator D strives to differentiate real samples from those that are generated. For each real image y in the training set, D(y) measures the probability of the input image being real. The first term of Eq. 16, Ey [log(D(y)] drives D to assign high confidence to real samples. For each random noise vector z with landmark conditioning, the generator produces a synthetic image G(z). The second term Ez [log(1 − D(G(z)))] pushes G to create images that can successfully fool the discriminator. Throughout the training process, mini-batches of landmark image combinations are provided to the discriminator in addition to landmark-conditioned synthetic images generated by G. This method is repeated across the complete dataset, with the loss values calculated per batch to estimate the expectations described in Eq. (16).

In our proposed approach target image to be generated is conditioned on AVI image, in the training process conditional GAN(CGAN) are preferred over GAN. Its learning is therefore mapped with a conditional image given by G: [x, z]→y, x is the condition to be learned while in the training phase. Now the Eq. 16, with the objective of CGAN incorporating x is updated as:

$${{\text{L}}_{{\text{CGAN}}}}\left( {{\text{G}},{\text{D}}} \right)\,=\,{{\text{E}}_{\text{x}}}{,_{\text{y}}}[{\text{log}}\left( {{\text{D}}\left( {{\text{x}},{\text{y}}} \right)} \right]\,+\,{{\text{E}}_{\text{x}}}{,_{\text{z}}}[{\text{log}}\left( {{\text{1}} - {\text{D}}\left( {\left( {{\text{x}},{\text{G}}\left( {{\text{x}},{\text{z}}} \right)} \right)} \right)} \right]$$
(17)

The solution to (17) can be thought of as a hit by D to maximize (17), wherein G always tries to minimize (17). Thus the solution is expressed as follows:

$${{\text{G}}_{{\text{sol}}}}={\text{ mi}}{{\text{n}}_{\text{G}}}{\text{ma}}{{\text{x}}_{\text{D}}}\left[ {{{\text{L}}_{{\text{CGAN}}}}\left( {{\text{G}},{\text{D}}} \right)} \right]$$
(18)

As shown in Fig. 2, the facial expression synthesis framework includes three stages: Generation of Input AVI Facial Sentiment label; Preparation of the Target landmark and Synthesis of the Target Facial Expression from Target landmark.

Stage 1 presents the generation of input AVI facial sentiment label (le). As our proposed approach has to use only one input image which is the user AVI, we used OPENFACE 2.0 tool to analyze the facial behavior in the AVI and to generate the input AVI label (le). OpenFace 2.0 is a computer vision framework which has algorithmic modules defined to implement facial behavioral analysis, facial landmark extraction, pose estimation and facial action detection. The issues from non-frontal and occluded are addressed by the tool using varied convolutional layers. Because at this stage we have only one single image, we used this pertained model which has good real time performance and without the GPU need, to attach the expression label to the AVI.

Stage 2 presents preparation of the target landmark (lT). We fed the user AVI to the DLIB model to initially extract the 68 landmarks from which 32 are chosen. Stage 1 has generated an expression label le for the input AVI. To analyze the facial sentiment, the key prominent facial features are: eyes and mouth. So we cut the jawline and other landmarks and ignored the nose landmarks. Our proposed landmark detector considered the custom 32 landmarks as shown in Table 5. Our approach calculated the euclidean distance between the corresponding landmarks of the user AVI, supporting the expression category (Fi). Based on the required target facial expression (Fj) our approach reduced or increased the respective euclidean distances between the landmarks. We prepared the target landmark (lT) according to the extracted tweet sentiment label (Tc) from Sect. 3.1. A sample of target landmarks mapping as function, lT : FiFj is shown in Table 2.

Table 5 GAN-AVI customized landmarks.

We generated target landmark image by reducing the distance between the AVI landmarks to meet the output sentiment from Sect. 3.1. As an example case the AVIs sentiment category is F1 (smile) and the tweet is with Negative sentiment, so we have to follow the target landmark mapping, lT : F1F2; for i = 1 and j = 2 from smile to Sad, dr(i, j) are the initial landmark distances of the AVI across various facial components, while d1r(i, j) are the reduced landmark distances. Model experiments are done on various % of reduced distances from 20 to 70% on various facial component landmarks. We considered different distance reductions to different components like eye and lip and are shown in Table 2.

The 32 landmarks play a key role in synthesizing target facial expressions. They serve as reference points for generating two key facial expressions: smile, sad. In translating sentiment from text into a target expression, stage 2 modifies the placement of these landmarks based on the intended emotional configuration (e.g., raising mouth corners for joy, lowering eyebrows for anger).

Stage 3 describes the synthesis of the target facial expression (x1) from target landmark (lT). The input image to this module is a labeled image, denoted by x, with an expression labeled le and lT which is the target landmark with lT = \(\:\sum\:_{i=1}^{32}(pi,qi)\). We used Landmark conditional GAN (LCGAN) to generate our target AVI (x1) facial expression based on the conditioned target landmark lT. The translation mapping can now be formulated as:

$${{\text{G}}_l}\left[ {{\text{x}},{\text{ }}{{\text{z}}_1},{l_{\text{e}}}} \right]{\text{ }} \to {l_T}:{\text{ }}{{\text{G}}_l}\left[ {{{\text{x}}^{\text{1}}},{{\text{z}}_1},{F_j}} \right]$$
(19)

where z1 is the random noise, x is the input AVI, x1 is the Target AVI, and Fj are the encoded expressions, obtained by reducing Euclidean distances from Table 2. We used LCGAN embedding model51 to train on these reduced Euclidean distances of the target AVI. LCGAN model is used to preserve and reconstruct the key facial expressions and color, from x to x1 and Fj is now encoded vector of expressions. Here at this stage the objective of our proposed GAN-AVI could be:

$${{\text{L}}_{{\text{CGAN}}}}\left( {{{\text{G}}_l},{{\text{D}}_l}} \right)\,=\,{{\text{E}}_{lT,{\text{ }}x}}^{1}[{\text{log}}\left( {{{\text{D}}_l}\left( {{l_T},{x^1}} \right)} \right]\,+\,{{\text{E}}_{{\text{x,z1}}}},[{\text{log}}\left( {{\text{1}} - {{\text{D}}_l}\left( {{l_T},{\text{G}}\left( {{l_T},{\text{ }},{{\text{z}}_1}} \right)} \right)} \right]$$
(20)

The reconstruction loss of the landmarks is constructed on the sequence of (x, y)=(p, q) coordinates of the Landmarks(LM) which is given by:

$$\:{\text{L}}_{LM}\:\left({\text{G}}_{l}\right)\hspace{0.17em}=\hspace{0.17em}{\text{E}}_{l,\:lT}\:[1/32\sqrt{d}],\:\text{w}\text{h}\text{e}\text{r}\text{e}\:\text{d}=\:{(\text{p}\text{i}-\text{p}\text{r}\text{e}\text{d}(\text{p}\text{i}\left)\right)}^{2}+\:{(\text{q}\text{i}-\text{p}\text{r}\text{e}\text{d}(\text{q}\text{i}\left)\right)}^{2}$$
(21)

Where x1 = Gl [x1,z1,Fj] ] = \(\:{\sum\:}_{i=1}^{32}(pred\left(pi\right),pred\left(qi\right))\).

Algorithm 2
figure b

Proposed GAN-AVI algorithm.

Algorithm 2 shows the steps in proposed GAN-AVI method. From the architecture of Fig. 3, the extracted tweet is given input to the text sentiment label extraction framework discussed in Sect. 3, so as to assign a sentiment label (Tc) to the tweet. Here to our sample tweet as shown in Fig. 2, the method assigns Tc = Negative. The user AVI is given input to the customized face expression synthesis framework, to initially extract the AVI facial sentiment label (le). Here to our sample AVI as shown in Fig. 2, the facial sentiment label is le = Positive(smile). The same AVI is given input to the DLIB library to extract custom 32-landmarks of mouth and eyes. These custom landmarks of mouth and eyes are now modified so as to get reenacted face landmarks meeting the tweet sentiment, Tc = Negative(sad) i.e. we have to apply the transformation mapping F1F2, from smile to sad i.e. the landmarks showing smile expression are now to be modified to show sadness. This modification is done by reducing the landmark distances of left and right eye boundaries by 20%, inner lip landmark distances by 90%, outer lip landmarks by 50% as discussed under the target landmark mappings of Table 2. Thus, we have prepared our target landmark (lT), which is analogous to the tweet sentiment Tc = Negative(sad). Now from this target landmark we have to synthesis the face. For this our approach used LCGAN conditioned on the Target landmark (lT), which is trained to meet the objective function given in Eqs. 20,21.

Fig. 3
figure 3

GAN-AVI architecture.

Experimental evaluations

Datasets and training

The effectiveness of our proposed model is evaluated by testing and validating on different publicly available datasets. Tweets data was taken for training text sentiment extraction framework. The build model was trained on a publicly available tweets data: a total of 5000 GitHub scrapped public tweets, to understand the tweet sentiment. As well around 5000 phenomenal tweets on accidents were scraped using tweet API. We used BERT model to assign the sentiment label to data. We thus created a labeled tweet dataset of 10,000 rows, referenced at52. Text sentiment extraction framework was trained on this dataset to learn assigning a sentiment label to a single tweet.

Image data was used to train face expression synthesis framework. Face expression synthesis framework used dlib landmark extractor to extract the target landmarks 32 in number. To extract these custom landmarks first it is trained on three different datasets: (1) WFLW, (2) FER2013 and (3) RAFD dataset. WFLW available at53 is a database of wider landmarks for about 10,000 face images with around 98 facial landmarks marked. This is a widely used benchmark dataset for facial expression analysis. For landmark synthesis module we trained the model using the publicly available datasets: FER2013 and RAFD dataset. FER2013 image dataset54 has nearly 35,000 facial images categorized under various expressions like: anger, disgust, happy, sad etc. Our proposed approach took only three categories: sad, happy and neutral forming around 18,000 images. RAFD55 dataset has around 15,000 facial images with various emotion expressions. Among that we took only 3000 images of three categories: sad, happy and neutral.

To classify the sentiment category, we preferred the landmarks of mouth and eyes. From the trained face expression synthesis framework, custom 32 landmarks of face and mouth are extracted into a custom dataset with expression labels. The size of the dataset is with 32 landmark features and 2500 records. The dataset is available at56. To extract these 32 landmarks, we used OpenFace packages like dlib and OpenCV. For each dataset we took 65:35 ratios for training and testing. For GAN implementation all the images are rescaled to 256 × 256 size.

The proposed method was fine-tuned by optimizing various parameters for each component. All experiments were conducted using PyTorch with NVIDIA GPU support to guarantee effective training. The text sentiment label extraction module was trained using a batch size of 64, with a single embedding layer of 128 dimensional, one hidden layer with dimension 192, a learning rate of 2 × 1e⁻³. BiLSTM fine-tuned between 1 and 15 epochs. The discriminator(D) loss, generator(G) loss and validations accuracy of this module for various epochs are as shown under Table 6.

Table 6 Discriminator/generator loss and validation accuracy.

For facial synthesis from target landmark we conducted another set of experiments, where we custom changed the landmarks corresponding to eyes and mouth, as mentioned in Table 2, meeting the required sentiment label. We further generated the faces with extreme change of the distance between the landmark, to get the required output. We used different settings of 20 to 70% distance reduction approaches so that landmarks can be moved closer or away, mapping to the label sentiment of sad and happy.

Evaluation metrics

Our approach employed quantitative metrics like: average error distance(AED), Structural Similarity Index (SSIM), Landmark Difference (LMK) and Identity difference(ID) across target and synthesized face. LMK aims to preserve the input image landmarks and focus on evaluating the fidelity of synthesized images. We used AED as a metric for landmark detection accuracy. we calculated landmark difference by first normalizing the landmarks and then found the error between the input face and the synthesized face given by Eq. 22. The normalization is done with the inter ocular distance (IOD), which is the calculated distance between two eye centers.

$$\:\text{e}\:=\:\frac{1}{M}\:{\sum\:}_{i=1}^{M}\frac{{\sum\:}_{i=1}^{N}\sqrt{\left(pi-\tilde{p}i\right)2+(\left(qi-\tilde{q}i\right)2}\:\:\:\:\:}{\:N*|lei-rei|2}$$
(22)

With M test landmark images, N number of landmarks (N = 32 in our experiment) (pi, qi) are the landmark coordinates of the ground truth image, (p͂i, q͂i) are the predicted landmark coordinates. lei, rei are the left eye center landmark and right center landmark respectively of the ith test sample.

SSIM is a metric frequently used for assessing image quality and measures the similarity between the source image and the synthesized image. SSIM is usually calculated on various pixel windows. It is generally not a distance function, but satisfies the symmetric properties on pixels. Three individual metrics are a part of SSIM, the image luminance (L), Contrast (C) and structure (S).

$$\:\text{S}\text{S}\text{I}\text{M}(\text{x},\text{y})=\:\:{\left[\frac{2\mu\:x\:\mu\:y\:+\:C1}{{\mu\:x}^{2}+\:{\mu\:y}^{2}+C1}\right]}^{\alpha\:}\:\:{\left[\frac{2\sigma\:x\:\sigma\:y\:+\:C2}{{\sigma\:x}^{2}+\:{\sigma\:y}^{2}+C2}\right]}^{\beta\:}\:{\left[\frac{\sigma\:xy\:+\:C3}{\sigma\:x\:\sigma\:y\:+C3}\right]}^{\gamma\:}$$
(23)

x and y are two images to be compared, with mean \(\:\mu\:x,\:\:\mu\:y\) and standard deviation \(\:\sigma\:x\:,\:\:\sigma\:y\) and covariance \(\:\sigma\:xy\).

The SSIM first term is for luminance comparison in a range of [0,1] and is used to judge on \(\:\mu\:x=\:\mu\:y\) similarity, with a best value of 1. The second term using the standard deviation \(\:\sigma\:\), compares the image contrasts in a range [0,1] and this is used to judge the variance \(\:\sigma\:x=\:\sigma\:y\), with a best value of 1. The third term which is the structure comparator measures the correlations between the two images in a range of [-1,1] with a best value of 1. These three components importance is guided by the weights \(\:\alpha\:>0,\:\beta\:>0\:,\:\gamma\:>0.\:\).

As our approach used custom modified landmarks of eyes and mouth, we used cross correlation metric r*(x, y), to evaluate the correlations of the SSIM structure metric; the luminance, contrast and the structure.

$$\:\text{S}\text{S}\text{I}\text{M}\:\text{r}\text{*}\text{L}(\text{x},\text{y})\:=\:\frac{\sigma\:xy}{\sigma\:x\sigma\:y\:},\:\text{S}\text{S}\text{I}\text{M}\:\text{r}\text{*}\text{C}(\text{x},\text{y})\:=\:\frac{\sigma\:xy}{\sigma\:x\sigma\:y\:},\:\text{S}\text{S}\text{I}\text{M}\:\text{r}\text{*}\text{S}(\text{x},\text{y})\:=\:\frac{\sigma\:xy}{\sigma\:x\sigma\:y\:}$$
(24)

Where r*L(x, y) is the correlations of luminance of input image to target image; r*C(x, y) is the correlations of contrast of input image to target image; r*S(x, y) ) is the correlations of structure of input image to target image.

In practical SSIM adopts a sliding window to evaluate these metrics. For our custom landmark reduced distances we used a sliding window of size 9 × 9 to capture the componential correlations. The pooled SSIM with n number of slides is given by

$$\:{\text{S}\text{S}\text{I}\text{M}}_{\left(\text{pooled}\right)}\:\:=\:\frac{1}{n}\sum\:_{1}^{n}\text{S}\text{S}\text{I}\text{M}(\text{x}\text{i},\text{y}\text{i})$$
(25)

Table 7 shows the quantitative evaluation of the pooled metrics the luminance(L), contrast(C) and the structure(S), evaluated on different distance reductions. The table shows Pooled metric values for a sample of reduced distances between right eye boundary(C1), left eye boundary(C2), lip inner line(C3), lip outer line(C4), for a mapping lT : F1F2 .

Table 7 Pooled SSIM values for varied distances.
Fig. 4
figure 4

SSIM r*S(x, y), structure similarity values of varied reduced distances.

Figure 4 shows the structure similarity values for varied reduced distances on to what extent the structure of eye and lip components are preserved in our approach. The figure shows the structure index is retained more when right and left eye boundary distances are reduced by 35–45% and lip outer line and inner line boundary distances are reduced by 40–55%.

Baseline comparison

We compared our approach to other baseline landmark detection and facial synthesis models: GANimation57, X2Face58, AttentionGAN59, C2GAN60. Table 8 shows the parameters of the proposed model in comparison to those taken in baseline models.

Table 8 Proposed model parameters in comparison to baseline models.

GANimation model uses target action units (TAU), which are facial local components categories based on muscle moments. Like lip stretch muscle moment may identify fear. These TAUs guide the process of target image generation. X2Face is another self-supervised model and is used to transfer expression from source image to target image. AttentionGAN are GANs with guided attention mechanism. They have the ability to percept most discriminative components between the source and the Target. They can perceive the most discriminative foreground object by favoring minimal background changes. C2GAN is a cycle in cycle GAN, a cross model framework with two Generators: Landmark Key point oriented and the other Image oriented. Each are cyclically connected aiming to reconstruct the Input and SSIM is what we aimed more. Table 9 shows the error of the baseline models which is calculated for the test samples using the metrics: AED, SSIM, LMK and ID.

Table 9 Error of proposed method in comparison to baseline models.

Our approach studied the error as given in Eq. 22, for different test sample on three landmark sets 68,51 and 32. Our AVI method on 32 land marks performed comparatively on par to baseline methods with 68 and 51 landmarks. Figure 5 shows the error observed over various test samples using the proposed. The proposed approach reaches the lowest error of 0. 2%, demonstrating its enhanced accuracy in landmark localization for micro-expression generation tasks.

Fig. 5
figure 5

Error observed for varied landmarks.

The four baseline models: GANimation, X2Face, AttentionGAN, and C2GAN are part of the wider category of Generative Adversarial Networks (GANs), commonly utilized in conditional image synthesis tasks like manipulating facial expressions, reenacting faces, and generating images based on landmarks. Though each of these architectures utilizes distinct conditioning methods, but they all face the shared difficulty of maintaining structural fidelity while generating new faces. A single unified metric that can be effectively utilized across these varied GAN frameworks is the Normalized Cross-Correlation (NCC). NCC measures the similarity between the generated images and the target images, regardless of overall brightness or contrast, making it ideal for face synthesis tasks. Through the normalization of pixel intensities and the emphasis on correlation patterns, NCC offers a reliable assessment of how accurately each model preserves local facial features like eyes, nose, and mouth, compared to their target images. The Normalized Cross correlation between two image regions X and Y is given by Eq. 26.

$$\:\text{N}\text{C}\text{C}(\text{X},\text{Y})\:=\:\frac{\sum\:_{i=1}^{N}\left({X}_{i}\:-\stackrel{-}{X}\right)({Y}_{i}-\stackrel{-}{Y})}{\sqrt{\sum\:_{i=1}^{N}{\left({X}_{i}\:-\stackrel{-}{X}\right)}^{2}}\sqrt{{({Y}_{i}-\stackrel{-}{Y})}^{2}}}$$
(26)

where Xi : pixel intensity of the ith pixel in the target region, Yi : pixel intensity of the ith pixel in the synthesized region, \(\:\stackrel{-}{X}\): mean pixel intensity of region X, \(\:\stackrel{-}{Y}:\) mean pixel intensity of region Y, N: number of pixels in the region. The error using the NCC is given by Eq. 27

$${\text{Error}}\left( {{\text{NCC}}} \right)\,=\,{\text{1}} - {\text{NCC}}$$
(27)

Table 10 shows the mean NCC and NCC error over the test samples observed in the proposed and baseline models taken.

Table 10 Normalized cross correlation error.
Fig. 6
figure 6

NCC and error observed for baseline models.

The proposed model (GAN-AVI) reaches the highest Mean NCC (0.93) as shown in Fig. 6, alongside the lowest error rate (0.07). This clearly shows its enhanced capability to maintain the correlation between produced and actual facial expressions compared to baseline models.

Conclusions and future work

There are major observed cases where the expression in user twitter AVI is unalike with the user tweet sentiment, and this may line up a different opinion on the tweeter or the user. To handle such instances a method which understands the tweet sentiments and dynamically changes the profile AVI is needed. In this work we presented a novel method known as GAN-AVI to understand the tweet sentiment and change the user AVI analogous to the tweet sentiment. Our approach used two frameworks: one to extract the sentiment from the tweet and other to transform the user AVI to target AVI, by facial expression synthesis as needed by the target AVI. While synthesizing the target face our approach used reduced Euclidean distances to change the facial components like mouth and eyes. Experiments are simulated on tweet dataset and image datasets from FER2013 and RAFD. The performance of the approach is evaluated using the metrics: AED, SSIM, LMK, ID. Our approach purely focused on SSIM, structure metric. Baseline models like: GANimation, X2Face, AttentionGAN, C2GAN are referred to compare and demonstrate the effectiveness of our approach. These models performance is evaluated using the NCC metric.

Developing a facial expression translator based on tweet sentiment pose several technical and conceptual challenges. Effectively aligning text-based sentiment with facial expressions necessitates strong natural language processing (NLP) models that can grasp context, sarcasm, cultural subtleties, and underlying emotions. Training the models to generate or select the correct facial expression and fetch the landmarks involves handling subtle variations, micro expressions, intensity levels, and symmetry, which are challenging to detect and render realistically. Additionally, there’s a danger of reinforcing biases if the system misreads expressions for specific demographic categories, resulting in ethical concerns and trust problems.

The dataset poses multiple obstacles that may influence the precision and reliability of a text sentiment-based expression translator. To begin with, the image quality differs in aspects like lighting, resolution, and background noise, potentially leading to inconsistencies in landmark detection. Facial obstructions, like hair, eyeglasses, or hand movements, decrease the dependability of key point detection.

Future studies might aim to address these challenges utilizing cutting-edge deep learning models, like transformer-based vision architectures to enhance feature extraction for intricate emotional states. Investigating the temporal dynamics of facial movements through video sequences, rather than through static images, may assist in identifying micro expressions and transitional emotional states. The studies aim to improve the accuracy of facial landmark detection by incorporating various data sources like depth data, infrared images, or thermal visuals, which could aid in identifying subtle muscle movements typically overlooked by 2D evaluations.

From a psychological standpoint, future research might explore how particular sets of facial landmarks, particularly near the eyes, mouth, and eyebrows, relate to subtle emotional conditions like contempt, embarrassment, or blended feelings. This creates possibilities for developing AI systems that are both precise in emotion categorization and attuned to cultural and contextual differences in facial expressions. Additionally, combining landmark-based emotion recognition with psychological profiling may assist in mental health assessments by identifying early indicators of stress, anxiety, or depression from facial expressions in real-world contexts.

Mousa Alalhareth: Software, Validation, Writing—review & editing, Supervision.