Abstract
Enabling personalized interactions and gaining deeper insights into user behavior has become increasingly important for improving human-computer interaction. However, the natural class imbalance in personality datasets, reflecting the real-world distribution of traits where certain personality types are less represented, poses a major challenge for classification models and often results in biased performance. This research explores a range of class imbalance mitigation techniques (CIMTs), including sampling methods and loss functions, to address this issue. We introduce Adaptive Focal Loss with Personality-Stratified Dataset Splitting, a novel approach specifically designed to mitigate class imbalance while stabilizing performance in multi-dimensional personality recognition. Additionally, we analyze multiple evaluation techniques., including regular accuracy, F1 score, and balanced accuracy, recommending the latter for a more comprehensive and fair performance analysis. Our experiments reveal that the proposed stratified techniques with label representation are vital for making the performance not sensitive to dataset splitting, while Adaptive Focal Loss significantly enhances classification performance on imbalanced datasets by incorporating trainable hyperparameter, also effectively addressing the challenges of hyperparameter sensitivity and selection. On average across dimensions, our method improves balance accuracy by up to 7% on the Kaggle dataset and 5% on the Essays dataset compared to regular training, while maintaining minimal computational overhead. These findings mark a critical step toward more robust and equitable personality recognition systems using relatively computationally efficient models. The related research resources, including the code repository and datasets, are available and can be accessed at: https://research.jingjietan.com/?q=AFLPS
Introduction
Personality recognition has significant potential across various fields, including psychology, social media analysis, and targeted advertising. With the rapid advancements in artificial intelligence and hardware capabilities, personality-aware functionalities in human-robot interaction have become increasingly valuable, particularly in contexts such as home-care robotics1. By understanding user personality, robots can deliver more personalized and effective assistance-such as adapting communication style to match user preferences, providing emotional support, and improving engagement. For example, a personality-aware home-care robot could adjust its interaction strategies to encourage physical activity for introverted users or offer more social stimulation for extroverted users. Such systems could also assist in early detection of psychological disorders, preventing potential crises by recognizing changes in behavioral patterns. Furthermore, personality recognition enables robots to tailor activity recommendations, optimize task scheduling, and create a more natural, user-centered experience, ultimately increasing trust and adoption of robotic systems in daily life.
However, existing personality recognition models often suffer from significant class imbalance, where certain personality traits are severely underrepresented, leading to biased predictions. A common approach to mitigate this issue is to scale up model size or expand training datasets, but these strategies drastically increase computational complexity and training time, making them impractical for many real-world applications2,3,4. Instead, we propose Adaptive Focal Loss, a novel method designed to address class imbalance while maintaining computational efficiency, offering a more scalable and effective solution compared to conventional approaches.
Literature review
Personality determinant
There are three ways to express a person’s personality: visual, audio, and textual modalities5. The major determinants of personality are biological factors and environmental influences6,7,8. These factors contribute to the uneven distribution of personality types9,10. Among these modalities, text-based personality detection is generally more accessible and practical compared to other approaches. This is primarily because it does not require specialized equipment for data collection. For instance, sample texts can be easily obtained from a user’s social media posts or conversations11.Moreover, the content of a message often reflects an individual’s thought processes and reveals their behavior12. Additionally, Yarkoni13 demonstrated a strong positive correlation between an author’s personality traits and their word usage. In the 21 st century, people frequently leave behind a “personality footprint”through social media posts that express their feelings and emotions14,15. This creates a significant opportunity to extract personality information from such indirect background data, which is particularly beneficial for initializing personality-aware functionalities in new robots or applications16.
Personality theories
There are several theories to describe a person’s personality, each proposing a different number of dimensions17,18. These include the Big-3 model, represented by the Eysenck Personality Questionnaire (EPQ)19,20, the Big-4 model, exemplified by the Myers-Briggs Type Indicator (MBTI)6,21, the Big-5 model, which includes the dimensions of“Openness, Conscientiousness, Extraversion, Agreeableness, and Neuroticism” (OCEAN)22, and the Big-6 model, represented by the HEXACO framework, which includes“Honesty-Humility, Emotionality, Extraversion, Agreeableness, Conscientiousness, and Openness”23. However, the relationships between personality traits across different theories exhibit only limited correlations to a certain extent24,25. This limitation restricts model conversion when a system26 needs to adopt a different personality framework. Nonetheless, for clarity and consistency, the dimensions in this paper are organized and aligned with the popular Big-5 personality model, as shown in Table 1.
Algorithms
In our study, we reviewed various methodologies for personality recognition, encompassing feature extraction techniques, classical machine learning algorithms, deep learning approaches, and strategies for addressing class imbalance.
Feature extraction techniques
Text normalization is a crucial preprocessing step that converts words into their fundamental forms to enhance tokenization performance. This is particularly important for English, which has an extensive vocabulary of approximately one million words, making direct tokenization impractical27,28. Two common normalization techniques are lemmatization and stemming. Lemmatization maps words to their base forms, or lemmas, while preserving their intended meanings. In contrast, stemming removes affixes to extract word stems without considering context29. Although lemmatization retains semantic meaning, it requires greater computational resources30,31. In summary, lemmatization preserves contextual meaning, whereas stemming is computationally more efficient. Following normalization, tokenization is a fundamental step in natural language processing (NLP) that segments text into smaller units, or tokens, which convey specific meanings within a given context. Tokenization can be classified into three types: word tokenization, sub-word tokenization, and character tokenization. Special tokens are typically included to denote sentence boundaries or serve padding functions32. Given the vast and evolving nature of the English lexicon, word tokenizers frequently encounter out-of-vocabulary (OOV) issues26. During inference, unknown words are assigned a designated token to indicate their presence.
Feature extraction in NLP involves deriving meaningful information from text data to support further analysis and modeling. This step is critical in machine learning, as it transforms raw text into a structured format suitable for pattern recognition and predictive modeling. Common feature extraction techniques include Bag of Words (BoW), word embeddings, and psycholinguistic features. The BoW approach, often combined with Term Frequency-Inverse Document Frequency (TF-IDF), constructs feature vectors by counting word occurrences within a document. Each vector element represents the frequency of a particular word in the document. Word embedding, a more advanced technique, maps words into a continuous vector space33, capturing semantic relationships. Methods such as Word2Vec and GloVe employ unsupervised learning to generate these vector representations by analyzing word co-occurrences within large corpora. On the other hand, psycholinguistic features extend beyond surface-level textual representations to incorporate semantic and psychological dimensions34,35. These features include metrics such as word frequency, concreteness, imageability, emotionality, and valence36. Integrating psycholinguistic features into NLP models enhances their ability to capture the psychological and emotional nuances of language, thereby improving performance in sentiment analysis, personality recognition, and related tasks37.
Machine learning models
Mairesse et al.38 proposed state-of-the-art research in examining the psycholinguistic feature set. The research team studied the feature set performance using the Essays corpus. They applied correlational analysis toward psycholinguistic features to sort out the significant feature for personality classification. Mairesse et al.38 concluded several findings. Although extraversion is easily been noticed in a conversation, however, it hardly reflected in the written text due to humans might reconstruct and lose their personality behavior to a certain extent12. In addition, the Essays dataset is collected in a more formal ambiance which makes the written text inclined to be in a more formal setting that is negatively correlated to extraversion. These are the reasons why the Essays dataset is challenged to produce better results.
Majumder et al.39 proposed a Convolutional Neural Network (CNN)-based architecture for personality recognition using the Essays dataset. Their model achieved up to 62.68% accuracy in the Openness dimension, outperforming state-of-the-art approaches by integrating psycholinguistic features. Furthermore, Rahman et al.40 and Tinwala & Rauniyar41 re-implemented the same architecture proposed by Majumder et al. (2019) with variations in activation functions. Rahman et al.40 reported that the hyperbolic tangent function produced a better F1 score (up to 59.80% in Openness) compared to sigmoid and leaky ReLU. Additionally, Ontoum & Chan42 found that a Recurrent Neural Network (RNN) achieved an overall accuracy of 49.75%, surpassing Naïve Bayes and Support Vector Machines, which achieved accuracies of 41.03% and 41.97%, respectively. Furthermore, Amirhosseini and Kazemian introduced the use of Extreme Gradient Boosting (XGBoost), utilizing TF-IDF features for personality type prediction43. Moreover, Salminen et al.2 experimented with personality classification using bidirectional Long Short-Term Memory (bi-LSTM) combined with a word-to-vector model across three datasets. Their approach demonstrated an average improvement of 0.1 in F1 score across the five personality traits. Furthermore, Hernandez & Scott44 evaluated the performance difference between post-based classification and user-based classification (aggregating all posts from individual users in the dataset) using RNNs. They found that user-based classification improved accuracy by approximately 10%, highlighting the potential of longer text samples in revealing more comprehensive information about a user’s personality.
Moving on, Mehta et al.3 achieved new state-of-the-art performance in personality prediction by integrating BERT embeddings with psycholinguistic features. Their hybrid approach combines automated feature extraction with theoretical linguistic insights, addressing the limitations of models that rely solely on either deep learning or hand-crafted linguistic features. Using interpretable machine learning, they quantified the impact of language features, enhancing both predictive accuracy and model interpretability. Over time, personality prediction models have progressively improved by leveraging complex deep learning architectures, including knowledge graph attention neural networks45, to enhance machine-readable cognitive understanding of semantic relationships. While these advancements enhance predictive performance, they also introduce greater computational complexity. However, a persistent bottleneck remains: previous studies have not adequately addressed the issue of class imbalance, which limit the model’s generalizability and fairness.
Class imbalance mitigation techniques
There are various class imbalance mitigation techniques available. The simplest approach involves modifying the dataset through oversampling or undersampling. Oversampling duplicates minority class samples, while undersampling removes majority class samples to achieve a balanced distribution. A more sophisticated alternative is the Synthetic Minority Oversampling Technique (SMOTE), which utilizes k-means clustering to enhance the effectiveness of oversampling. Cerkez & Vareskic4 studied CIMTs in dataset-wise. The algorithm includes undersampling, oversampling, and SMOTE using various classifiers (Support Vector Machine (SVM), Long Short-Term Memory (LSTM), etcetera.) to resolve the multiclass imbalance issue in the Kaggle dataset. They showed that oversampling is the best approach among all which produces an F1-score of 0.287 compared to SMOTE that only has 0.144. Additionally, they also found the undersampling technique is not suitable for this task as the result obtained is only 0.038 which is significantly lower than other classifiers.
On the other hand, Lin et al.46 introduced focal loss, a cost-sensitive learning technique designed to address class imbalance in multiclass object detection by emphasizing hard-to-classify examples. We observed that focal loss offers favorable computational efficiency and may be particularly useful in personality recognition tasks, where class distinctions are often subtle and ambiguous. By down-weighting well-classified instances and focusing on hard samples, focal loss provides a mechanism to improve fairness across dimensions by implicitly weighting minority labels. This is potential for multi-personality-dimension-based recognition, which has not yet been explored. This gap presents an opportunity to adapt and extend focal loss as a core component of our CIMTs.
To the best of our knowledge, most existing research does not adequately address the class imbalance problem, particularly in studies that adopt the imbalanced Kaggle dataset. This work seeks to fill that gap by proposing more suitable evaluation metrics for assessing model performance, with a particular emphasis on CIMTs. Notably, prior research has predominantly relied on regular accuracy (RA) as the primary evaluation metric, often overlooking accuracy across positive and negative labels offers a more comprehensive and reliable assessment.
Datasets processing
To the best of our knowledge, existing research3,4,43 has not employed suitable stratified algorithms (some not even applying label-stratified algorithms that make class-wise equal splitting) to ensure an equitable distribution of individuals when creating evaluation sets. This leads to increased model bias, as the distribution of other personality dimensions remains unmonitored, thereby introducing further bias. This not only arises in imbalanced datasets but also in balanced ones if not properly addressed.
On the other hand, recent research tends to prepare different model for different personality dimension. However, this mean that the number of retrained is proportional to train, and also the inference time and cost. Hence, some research4 is starting to re-categorizing it the dimensions into distinct classes (\(2^n\)) instead of n binary classification. Building on Deimann’s findings47 that the combination of two different dimensions shows distinct trends in text generation, we conducted an experimental study on our proposed label representation strategy. This approach treats each dimension as a separate task, enabling shared perceptrons to facilitate knowledge exchange.
Contributions
The key contributions of this work are as follows:
-
We propose new adaptive focal loss techniques to address the imbalanced dataset for personality recognition by focusing on difficult examples.
-
We introduce novel personality-stratified approaches for dataset distribution and propose a suitable label representation strategy.
-
We suggest appropriate evaluation metrics with rigorous mathematical proofs and assessment.
Methods
As shown in Fig. 1, we visualized the proposed adaptive focal loss algorithm and its training pipeline specified for multi-dimensional personality recognition. In addition, we also included different machine learning methods and features for comparison studies.
Illustration of the proposed Adaptive Focal Loss with Personality-Stratified Splitting for Hard Class Imbalance in Multi-Dimensional Personality Recognition. Following the proposed dataset stratification and splitting approach, the training process begins with feature processing, including proposed novel datasets processing techniques along with text tokenization, feature extraction (LIWC and TF-IDF), and tensor conversion. A batch balancing inspector assesses class distribution, and a weight adjuster computes instance-specific weights, which are later utilized alongside difficulty scaling in the Adaptive Focal Loss function. During backpropagation, binary cross-entropy (BCE) loss is computed, and an adaptive focal factor is applied to adjust gradient scaling dynamically.
Datasets processing
This study employed two publicly available datasets: the Essays dataset (2,479 entries, categorized into five classes based on the Big-5 model: ‘O’, ‘C’, ‘E’, ‘A’, ‘N’)48, and the Kaggle dataset (8,675 entries, categorized into four classes based on the Big-4 (MBTI) model: ‘O’, ‘C’, ‘E’, ‘A’)49. The label distributions for each dimension are illustrated in Fig. 2. To quantify the degree of class imbalance, we use the Gini index (G)50, defined in Eq. 1, where p denotes the proportion of each class. The Gini index ranges from 0 (complete purity, one class dominates) to 0.5 (maximum impurity, perfectly balanced classes). Thus, lower G values indicate higher imbalance and reduced class diversity, whereas higher values reflect a more even distribution. Based on this measure, the Essays dataset is relatively balanced (\(G=0.499\)), while the Kaggle dataset shows pronounced imbalance (\(G=0.392\)).
The overall distribution of positive labels for each dimension in the Essays and Kaggle datasets reveals notable differences in balance, with the respective Gini index shown in parentheses. Notably, the Kaggle dataset exhibits significant class imbalances, particularly in the ‘O’ and ‘E’ dimensions, with majority-minority label ratios of approximately 86:14 and 77:23, respectively, as reflected in their Gini indices of 0.238 and 0.355, indicating a skewed distribution that may impact model performance. In both datasets, the dimensions are ranked from most imbalanced to least imbalanced as follows: for the Essays dataset, the order is A >E >O >C >N, and for the Kaggle dataset, it is O >E >C >A.
Fig. 3 illustrates the proposed personality-stratified splitting algorithm, which ensures a balanced distribution of personality labels across all dimensions in both training and evaluation sets. The dataset is partitioned according to distinct personality classes, allowing each split to include a diverse range of text samples from writers with varying trait combinations. For example, when introversion and extraversion are chosen as anchor dimensions, the distribution of the remaining traits is also balanced, ensuring the model encounters sufficient examples of individuals who are, for instance, extroverted but low in neuroticism as well as those high in neuroticism. Without such stratification, train-test splits may become unbalanced across these combinations, causing performance to fluctuate and making evaluations overly dependent on random data splits rather than model quality. Our method addresses this by providing a more stable and reliable foundation for model assessment and future research.
For each sample, each personality dimension having a binary value, either ‘0’ or ‘1‘, so the total number of personality type combinations is determined by the number of dimensions in the personality model, calculated as \(2^d\), where d denotes the number of personality dimensions. Using Fig. 3 as example, MBTI theory (4 dimensions), results in \(2^4 = 16\) personality types, whereas the Big Five personality model, comprising five dimensions, results in \(2^5 = 32\) distinct classes. We adopted Eq. 2 to represent the individual personality labels into a single integer value. For instance, consider a person with openness (\(O = 1\)), conscientiousness (\(C = 1\)), extraversion (\(E = 0\)), agreeableness (\(A = 0\)), and neuroticism (\(N = 0\)). The representation is calculated as: \(2^0 \cdot 1 + 2^1 \cdot 1 + 2^2 \cdot 0 + 2^3 \cdot 0 + 2^4 \cdot 0 = 3\). The overall dataset distribution for training and evaluation follows an 8:2 ratio, where the training set is further divided into training and validation subsets, resulting in an overall split of [8(8:2)]:2. Nonetheless, this representation is only used for data sampling and does not act as a multiclass training label. The ready datasets are available in51,52.
Visualization of a personality-stratified algorithm on label combinations.(4 dimensions for MBTI theory, 16 distinct classes) in the personality model, with an equal distribution across training, validation, and evaluation sets. The distribution aims to maximize variation without anchoring any specific dimension. Darker colors represent 1, while lighter colors represent 0 for each respective dimension.
Datasets cleaning
The processing begins with dataset preparation, followed by text normalization to remove extraneous information such as variations in case, symbols, and links present in the dataset. This step utilizes the regex library to filter out invalid characters. To remove hyperlinks, we use the regular expression pattern: https?:.*§+ and similarly, non-alphanumeric symbols are filtered using: \(\texttt {[}\hat{\,}\,\texttt {0-9a-z]}\)
Text tokenization
Text tokenization is applied to segment documents into smaller units, such as words, to facilitate computational analysis. We utilize WordNet53, a lexical database that plays a crucial role in tokenization by providing structured relationships between words, including synonyms, antonyms, and hypernyms. This enhances natural language understanding and improves text processing54.
LIWC features
We employed Linguistic Inquiry and Word Count (LIWC) to extract psycholinguistic features from textual data. These predefined psychological and linguistic dimensions-such as emotions, cognitive processes, and social concerns-serve as features for analyzing individuals’ psychological and emotional states based on their linguistic choices48. In this experiment, we adopted 80 LIWC features55.
TF-IDF feature
We employ the Term Frequency-Inverse Document Frequency (TF-IDF) algorithm to quantify the importance of words within a document relative to a collection of documents (corpus). TF-IDF is computed as shown in Eq. 3, where \(\text {TF}(t,d)\) represents the term frequency of term t in document d, and \(\text {IDF}(t,D)\) is the inverse document frequency, defined in Eq. 4. Here, |D| denotes the total number of documents in the dataset, while \(|{d \in D : t \in d}|\) represents the number of documents containing term t. To enhance computational efficiency and reduce noise, we limit the TF-IDF feature set to the top 5000 terms with the highest variance across the dataset. This selection ensures that only the most informative terms contribute to the downstream analysis.
Convert tensor
Now that each sample in the dataset is prepared, we convert the features into tensors. Since both the LIWC features (80-length vector) and TF-IDF features (5000-length vector) consist of floating-point values, we store them as tensors to enable efficient batch processing using CUDA. For experiments involving both feature sets, the two tensors are concatenated, resulting in a final feature vector of length 5080.
Class imbalance mitigation techniques
Oversampling (OS)
Oversampling is a statistical-based class imbalance mitigation technique (CIMT) that helps balance the representation of features, ensuring that statistical findings do not disproportionately favor the majority class56. The oversampled dataset with class stratification can be obtained using Eq. 5, where D represents the original dataset, and \(D'\) denotes the oversampled dataset, and the \(\alpha\) denote the ratio of majority over minor class.
Synthetic Minority Oversampling Technique (SMOTE)
SMOTE unlike simple oversampling, it is designed to generates synthetic samples by interpolating between existing minority class samples, which effectively avoid overfitting bias in classical machine learning models57. The generation of synthetic samples in SMOTE can be formalized as follows: For a given minority class sample \(x_i\), a new synthetic sample \(x_{\text {new}}\) is generated by selecting one of its k nearest neighbors \(x_{\text {k}}\) and applying the interpolation rule in Eq. 6 where \(\lambda\) is a random value drawn from a uniform distribution between 0 and 1, ensuring that the synthetic sample lies on the line segment between \(x_i\) and \(x_{\text {k}}\). This approach encourages the generation of diverse synthetic samples while preserving the local structure of the minority class.
Weighted Loss (WL)
Alternatively, the loss weight can be adjusted for specific labels. Since the classification task involves binary labels (true or false) for each dimension, Binary Cross-Entropy (\(\mathscr {L}_{\text {BCE}}\)) is employed, as expressed in Eq. 7, where \(\hat{y}\) represents the predicted label and y the actual label.
The loss can be weighted. As shown in Eq. 8, Balanced Binary Cross-Entropy (\(\mathscr {L}_{\text {BBCE}}\)) adjusts the weight according to the positive labels. This rescaling can be applied either sample-wise or batch-wise by adjusting the weight of positive (1) labels via \(\alpha\). The scaling factor, controlled by \(\alpha _p\), determines whether to increase or decrease the contribution of positive labels. Here, \(\mathscr {L}_n\) and \(\mathscr {L}_p\) represent respective loss of the number of negative and positive labels, respectively. This ensures that positive and negative examples contribute equally to the overall loss. When no negative labels are present in a batch (\(\mathscr {L}_n = 0\)), the scale is set to 0 to avoid division by zero. Experiments also explored setting the scale to 1 in such cases, but this showed no significant impact on accuracy.
Moving on, the training can also adapt weighted loss according to the class distribution, namely Weighted Binary Cross-Entropy, \(\mathscr {L}_{\text {WBCE}}\) as in Eq. 9. The scaling difference compared to \(\mathscr {L}_{\text {BBCE}}\), the \(\mathscr {L}_{\text {WBCE}}\) assigns inverse frequency for each label. Hence, the minority label is recalibrated to a relatively higher weight compared to the majority label to increase its contribution to the overall loss function. Also, this can be done batch-wise and sample-wise. The weight can be enforced through \(\alpha\), where \(\alpha _y \in \left\{ 0, 1\right\}\), calculated using \(f_y\), the frequency of labels, and n is the total number of samples.
Focal Loss (FL)
With reference to46, we introduce focal loss based algorithm specify for personality recognition. Focal Binary Cross-Entropy, \(\mathscr {L}_{\text {FBCE-F}}\), is formulated in Eq. 10, where \(\mathscr {L}_{\text {BCE}}\) is the scalar output from binary cross-entropy, while \(\alpha \ \in \left( 0,\infty \right)\) and \(\gamma \in \left( 0,\infty \right)\) is the hyperparameter, the former is to handle the class imbalance issue (as in weighted BCE loss) and the latter (including the second part in the equation) is to adjust the rate that down-weighted the easy sample. In particular, the parameter \(\alpha\) serves as a scaling factor for the final loss. Since \(-\mathscr {L}_{\text {BCE}}\) always returns a non-zero negative value (\(\mathbb {R}^-\)), the transformation \(e^{\mathbb {R}^-}\) ensures the range is confined to (0, 1), which is inversely correlated with \(\hat{y}\). We adapted the Binary Cross-Entropy (BCE) loss by replacing the logarithm from the original formulation, leveraging the PyTorch library to effectively handle cases of zero error.
In this context, we define “hard” samples as those with low prediction confidence (i.e., higher \(\mathscr {L}_{\text {BCE}}\) values), while “easy” samples correspond to high-confidence predictions (i.e., lower \(\mathscr {L}_{\text {BCE}}\) values). This relationship is illustrated in Fig. 4. The proposed loss function \(\mathscr {L}_{\text {FBCE}}\) is positively correlated with the standard BCE loss. The hyperparameter \(\gamma\) plays a crucial role in determining the threshold for distinguishing between easy and hard samples. It specifically influences the second term of the loss function, which subsequently impacts the overall loss value. Through this mechanism, higher values of \(\mathscr {L}_{\text {BCE}}\) increase the focal loss output, while lower values of \(\mathscr {L}_{\text {BCE}}\) decrease it. This ensures that easier samples have reduced impact during backpropagation. Hence, \(\alpha\) is critical for magnifying or weighting the loss for individual samples during model training. It can be set to a constant value or scaled dynamically. Notably, when \(\alpha =1\) and \(\gamma =0\), Eq. 10 reduces to the standard \(\mathscr {L}_{\text {BCE}}\).
Illustration of \(\mathscr {L}_{\text {FBCE}}\) versus \(\mathscr {L}_{\text {BCE}}\) (using \(\alpha =1\) and \(\gamma =2\)). When \(\mathscr {L}_{\text {BCE}}\) loss is small (0.5), the focal loss is significantly smaller (0.2052). Conversely, when \(\mathscr {L}_{\text {BCE}}\) loss is higher (0.75), the focal loss also becomes higher (0.9448).
In this work, we proposed 3 variants that can also be used for an ablation study.
-
1.
Weighted Focal BCE Loss, \(\mathscr {L}_{\text {FBCE-W}}\): In this variant, the \(\alpha\) is replaced by a weighting factor, w, according to the label. The loss is defined as in Eq. 11, where w is formulated by the \(\alpha\) of WBCE (Eq. 9).
$$\begin{aligned} \mathscr {L}_{\text {FBCE-W}}\left( \hat{y}, y\right) = w \cdot \left( 1 - e^{-\mathscr {L}_{\text {BCE}}\left( \hat{y}, y\right) }\right) ^\gamma \cdot \mathscr {L}_{\text {BCE}}\left( \hat{y}, y\right) \end{aligned}$$(11) -
2.
Trainable Focal BCE Loss, \(\mathscr {L}{\text {FBCE-T}}\): As shown in Eq. 12, this loss extends \(\mathscr {L}{\text {FBCE-W}}\) by making \(\gamma\) a trainable parameter rather than a fixed constant, with an additional regularization term to prevent extreme updates58. This design removes the need for manually tuning \(\gamma\), enabling the model to dynamically adjust the relative importance of easy and hard examples during training. By introducing a learnable difficulty-scaling parameter, the method enhances robustness and reduces sensitivity to hyperparameter selection. Consequently, the model can optimize gradient updates based on instance difficulty, leading to improved classification performance while mitigating class imbalance issues. The derivative with respect to \(\gamma\) is given in Eq. 13.
$$\begin{aligned} & \mathscr {L}_{\text {FBCE-T}}\left( \hat{y}, y\right) = w\cdot \left( 1 - e^{-\mathscr {L}_{\text {BCE}}\left( \hat{y}, y\right) }\right) ^{\gamma }\cdot \mathscr {L}_{\text {BCE}}\left( \hat{y}, y\right) + \log (\gamma ) \end{aligned}$$(12)$$\begin{aligned} & \frac{\partial \mathscr {L}_{\text {FBCE-T}}}{\partial \gamma } = w \cdot \left( 1 - e^{-\mathscr {L}_{\text {BCE}}\left( \hat{y}, y\right) }\right) ^{\gamma } \cdot \ln \left( 1 - e^{-\mathscr {L}_{\text {BCE}}\left( \hat{y}, y\right) }\right) \cdot \mathscr {L}_{\text {BCE}}\left( \hat{y}, y\right) + \frac{1}{\gamma }. \end{aligned}$$(13) -
3.
Multi-Dimensional Focal BCE Loss, \(\mathscr {L}_{\text {FBCE-M}}\): To avoid repeated training of the MLP for each personality dimension (d) independently, we introduce a multi-dimensional focal cross-entropy loss, \(\mathscr {L}_{\text {FBCE-M}}\). Built on top of the aforementioned \(\mathscr {L}_{\text {FBCE-T}}\), this formulation enables automatic hyperparameter selection for \(\gamma\), producing a unified value that generalizes across all personality dimensions rather than requiring manual tuning for each one. This approach allows simultaneous optimization over all personality dimensions while leveraging shared representations to improve learning efficiency. The formal definition of \(\mathscr {L}_{\text {FBCE-M}}\) is given in Eq. 14.
$$\begin{aligned} \mathscr {L}_{\text {FBCE-M}} = \sum _{i=1}^d \mathscr {L}_{\text {FBCE-T}} \end{aligned}$$(14)
In summation, using the aforementioned CIMTs, a multilayer perceptron (MLP) network is designed and trained. From empirical experiments, the optimized network consists of 3 layers, with each layer containing up to 25 neurons to avoid over-fitting. The Adam optimizer is utilized with a learning rate of 0.0005, and early stopping is applied to terminate training once the model converges. For consistency and fairness in comparison, the same model architecture is employed across all experiments, with only the hyperparameters associated with the CIMTs being varied. The training process is detailed in Algorithm 1.
Evaluation techniques
First, we conducted an empirical experiment to evaluate the stability of the proposed dataset using the equation below. The standard deviation, \(\sigma\), of the results was measured by Eq. 15, where \({BA}_i\) represents the Balanced Accuracy (Eq. 21) in the \(i^{th}\) run, \(\bar{BA}\) is the mean of the performance metric across all runs, and k is the number of experiments.
Next, to evaluate model performance, regular accuracy (RA), as defined in Eq. 16, is commonly adopted, where TP denotes true positives, FN false negatives, TN true negatives, and FP false positives. However, RA alone is not a reliable metric when the test data are imbalanced. For instance, if the number of positive samples (P) is larger than the number of negative samples (N), as in the case of the ‘O’ dimension in the Kaggle dataset where \(4N \approx P\), the equation can be reformulated as shown in Eq. 17. Notably, in the second term (which evaluates the performance of negative labels), the denominator is scaled by a factor of 5 when evaluating the performance of the negative class, which effectively dilutes its weight. As a result, the model is not incentivized to correctly predict TN, even though both positive and negative labels are equally important in personality recognition. Consequently, when all minority class (negative) samples are misclassified while all positive samples are correctly classified, as in Eq. 18, RA still reports a deceptively high value (\(\approx 0.80\)), failing to reflect the true performance of the model.
We do not encourage the use of the F1-score (Eq. 19) because it can provide misleadingly high values in class-imbalanced settings. Specifically, the F1-score does not decrease rapidly when the number of negative labels is relatively small (as in the RA case), and it tends to emphasize precision and recall without properly reflecting the imbalance between positive and negative classes. In fact, F1 is only reliable when the number of negative instances significantly exceeds the number of positive instances (i.e., when TP and FN dominate FP). Under the given distribution (\(4N \approx P\)), taking the extreme case where false positives approach the number of negatives and true positives approach the number of positives, as shown in Eq. 20, the F1-score remains deceptively high (\(\approx 0.89\)), which overstates model performance despite heavy misclassification.
Hence, balanced accuracy (BA), as formulated in Eq. 21, is introduced to provide a more reliable assessment that is not biased by the proportion of positive or negative samples in the test set. Specifically, the accuracy is calculated separately for positive predictions \((TP + FN)\) and negative predictions \((TN + FP)\). This emphasizes the importance of considering the number of TN when evaluating model performance. In contrast, a single FP-even when larger in number than TN-contributes relatively little to the evaluation of the model’s ability to classify the negative label. Consequently, as shown in Eq. 22, under the same label distribution configuration (\(4N \approx P\)), poor classification on the negative class (despite extremely high accuracy on the positive class) still results in a BA score of only 0.5. Given that this research has important implications for human-robot interaction, particularly in generating specialized feedback and enabling the early detection of mood or psychological disorders, BA is therefore the most appropriate metric to evaluate the classification model, as both classes are equally important in these scenarios.
Furthermore, since there are several dimensions, average accuracy difference across dimensions between BA and RA, known as AD, is applied to study the performance of respective CIMT on the model. In general, AD presented the general performance of the CIMT towards a utilized machine learning model or feed feature. AD is expressed in Eq. 23, where d is the number of dimensions. When \(BA-RA>0\), means the model bias toward the negative sample, and \(BA-RA<0\), means that the model bias toward the positive sample. Hence. we take the absolute value from \(BA- RA\) for summation, so the lower the value, the lower the biases of the model. In addition, this research adopts BA as the main evaluation metric of the algorithm, meanwhile, the RA is only used for sub-references.
We also include a discussion of the epochs after applying CIMT. To this end, we examine the Average Epoch (AE) across all dimensions in each CIMT, computed using Eq. 24, where d denotes the number of dimensions.
Finally, to validate the proposed framework, we employ McNemar’s test for statistical hypothesis testing. As in Eq. 25. the statistical significance is assessed by calculating the p-value using the cumulative distribution function (CDF) of the chi-squared distribution with one degree of freedom, as shown in Equation 25. Here, b is the number of instances correctly classified by the baseline but misclassified by the proposed approach, while c is the number misclassified by the baseline but correctly classified by the proposed approach. The resulting p-value is compared against the conventional significance level \(\alpha = 0.05\); if \(p < 0.05\), the null hypothesis is rejected, indicating that the proposed approach yields a statistically significant improvement in model performance.
Results
In this section, we report our experimental results in three parts: Dataset Stratification, showing stable evaluation across random seeds; CIMT Performance, evaluating the effectiveness of our proposed approaches using comparable implementations; and Benchmarks, highlighting how our methods compare with existing state-of-the-art results.
Dataset stratification
To assess model stability, we followed the proposed dataset splitting method and utilized the simple Multi-Layer Perceptron (MLP) algorithm with 10 different random seeds. Fig. 5 shows the performance plot and its standard deviation for each BA across 10 random seed variations in the Essays and Kaggle datasets, respectively, after implementing the proposed personality-stratified split.
According to the visualized stability variation with standard deviation values in Fig. 5, the label-stratified split (where the data is split solely based on its dimension) exhibits a wider range, indicating unstable performance-approximately 3% in the Essays dataset and 2% in the Kaggle dataset-whereas the proposed split maintains a more stable performance with only a 1% variation. This is more obvious in the more balanced Essays dataset, as there is no other factor to influence it. Moreover, we observe that the performance may drop below a flip coin probability (\(<0.5\)) as the sample lacks sufficient variety for the model to learn how a particular personality dimension behaves when combined with another personality dimension. If this combination is not present in the training sample, the model’s performance deteriorates further. As observed, using a personality-stratified split demonstrates performance improvement on BA, with an average improvement of around 1% in the Essays dataset and 2% in the Kaggle dataset. This show the proposed split provides more variety for the model to learn from, ensuring that the test set is not biased toward specific combinations-resulting in a more accurate test outcome. This further emphasizes the importance of using BA compared to F1 and RA, especially in imbalanced datasets, where BA effectively highlights minor biases that occur in the minority class of an imbalanced dataset.
On the other hand, we also validated the hypothesis that treating the problem as a multi-dimensional task (MLP (md)) or as multiple single tasks (MLP) yields better results than treating it as a multiclass (MLP (mc)) problem. As shown in Table 2 and Table 3, this is evident from the average balanced accuracy (BA) score, which indicates that the MLP (mc) model fails to learn effectively on the Essays dataset, while the MLP (md) model shows an improvement of more than 7% on the Kaggle dataset. This supports the hypothesis that merging these tasks would fail to accurately capture the unique characteristics and requirements of each individual dimension, as the model would not be explicitly guided to learn their distinct differences. When combining the confidence levels of each dimension into a multiclass prediction, the resulting confidence distribution becomes less interpretable and more ambiguous. This lack of clear differentiation could lead to a more complex and unreliable model output, ultimately compromising performance. In other words, predicting all dimensions correctly simultaneously in a 16-class setup is inherently more challenging, which would likely reduce the overall prediction accuracy.
Comparison of BA across personality dimensions using label-stratified and personality-stratified splits in (a) Essays and (b) Kaggle dataset. The box plots (utilizing whisper = 1.5) illustrate performance variation across 10 random seeds, with average BA (Eq. 21) and standard deviations (Eq. 15) indicated above each plot.
CIMT performance
By adopting the aforementioned evaluation techniques, the model performance on the Essays dataset and the Kaggle dataset are presented in Table 2 and Table 3, respectively. Next, we focus on the performance of different techniques and our proposed approach. Since the Essays dataset is considered balanced (with the label distribution approximately 0.50), the AD is significantly lower (\(<1\%\)) compared to the Kaggle dataset. This difference is particularly pronounced in the Kaggle datasets for the ‘O’ and ‘E’ traits, which have highly imbalanced class distributions. This highlights the model’s bias toward the majority class, as imbalanced datasets often lead to poor performance in classical classifiers. Moving forward, this research further experimented with the LIWC features and compared their performance on oversampled and normal datasets. The observation was that LIWC features contributed to relatively poor performance, which can be attributed to the psychological features alone being insufficient for accurate classification. Although some improvement was observed in certain dimensions, such as the ‘O’ dimension in the Kaggle dataset, there was a performance decrement in others, such as the“E”dimension in the Essays dataset. Therefore, subsequent experiments will focus solely on the TF-IDF features to ensure a fair comparison.
We oversampled the training set while keeping the test set unchanged and used the same hyperparameters for the classifier to ensure a fair comparison. Overall, we observed a significant decrease in the AD value on the Kaggle dataset, suggesting that oversampling effectively reduces model bias. The BA result for the DT classifier is relatively lower in both the normal dataset and the oversampled dataset, as Adaboost applies an ensemble method that combines several classifiers to produce a final classifier by assigning weights. This makes Adaboost effective in handling complex training challenges, including class imbalance. However, although Adaboost is less prone to imbalanced datasets, it shows a relatively lower AD in both the Essays and Kaggle datasets. Nevertheless, the performance can still be improved using oversampling, as the AD value improves from 0.001 to 0.000 in the Essays dataset. However, Adaboost does not produce a better BA result compared to SVM, LR, and RF after boosting with CIMT. Additionally, SVM and LR consistently outperform other classifiers in the Essays and Kaggle datasets. Although both are prone to imbalanced datasets, the results indicate that CIMT can significantly reduce the class imbalance effect. For the MLP model using TF-IDF features, the findings are consistent with those of classical machine learning models, where CIMTs generally produce better results. However, the AD for the oversampled Essays dataset decreases, indicating that oversampling can amplify data noise and does not necessarily lead to improved outcomes. Nonetheless, both the BA and AD improve in the Kaggle dataset, and the AE also decreases. This demonstrates that in highly imbalanced scenarios, CIMTs have the potential to help the model identify important features and learn effectively. Moving on, SMOTE demonstrated an improvement in performance. However, the improvement in BA was limited compared to Oversampling techniques. Notably, SMOTE did not mitigate bias; instead, it introduced additional bias due to the noise generated during the synthetic data generation process. This is evidenced by an increase in AD compared to the Oversampling approach. These findings are supported by59, which suggests that SMOTE does not always perform effectively, as it can sometimes amplify noise, potentially leading to degraded and unstable performance in imbalanced datasets. Furthermore, the Weighted Loss (\(\mathscr {L}_{\text {BBCE}}\) and \(\mathscr {L}_{\text {WBCE}}\)) techniques successfully improved the performance, particularly in handling the significant class imbalance, as observed with the ‘O’ and ‘E’ classes in the Kaggle dataset. Additionally, compared to \(\mathscr {L}_{\text {BCE}}\), both \(\mathscr {L}_{\text {BBCE}}\) and \(\mathscr {L}_{\text {WBCE}}\) reduced AE required for convergence, indicating that these methods facilitate more efficient learning for the model. However, there is no significant performance difference between batch-level and sample-level implementations of \(\mathscr {L}_{\text {BBCE}}\) and \(\mathscr {L}_{\text {WBCE}}\). On the other hand, the AE increased compared to the normal and oversampled Essays datasets, which contain balanced labels. This suggests a limitation in relying solely on weighting toward the minority class, as it may inadvertently amplify noise in the data.
In general, the proposed Focal Loss approach (particularly carry weighting factor \(\mathscr {L}_{\text {FBCE-W}}\), \(\mathscr {L}_{\text {FBCE-T}}\), \(\mathscr {L}_{\text {FBCE-M}}\)) improved the performance and robustness significantly in BA as well as reduced RA (avoid model bias) among all other CIMTs. In Kaggle dataset, the average BA increased notably from 0.7175 (without CIMTs) and 0.7582 (with \(\mathscr {L}_{\text {WBCE-S}}\)) to 0.7663 (with \(\mathscr {L}_{\text {FBCE-W}}\)). Moreover, the AD improved significantly, from 0.0871 (without CIMTs) to 0.0135 with \(\mathscr {L}_{\text {FBCE-W}}\). These results highlight that focal loss functions enable the model to effectively discriminate between easy and hard samples, allowing it to concentrate more on challenging data during training. In addition, the RA decline in FL from \(\mathscr {L}_{\text {BCE}}\) is observed for the ‘O’ dimension in the Kaggle dataset suggesting again that RA may not be an appropriate metric for assessing model performance in imbalanced situations. On the other hand, although the Essays dataset is relatively balanced, focal loss still led to notable performance improvements. The BA increased from 0.5378 (with \(\mathscr {L}_{\text {BCE}}\)) to 0.5876 (with \(\mathscr {L}_{\text {FBCE}}\)) and further to 0.5905 (with \(\mathscr {L}_{\text {FBCE-W}}\)). This demonstrates the ability of focal loss to identify and emphasize difficult samples during training, reducing the amplification of noise in the data.
Benchmarks
The body of research exploring the use of CIMTs in personality recognition remains relatively limited. Thus, our study seeks to address this gap by incorporating several contemporary models for comparative analysis. While we reference a state-of-the-art (SOTA) model, our primary objective is to introduce a novel approach to enhance model performance, which is crucial for practical applications. Since we observe that accuracy is sensitive to dataset splitting due to dataset size, and to facilitate a fair comparison, we re-implemented the \(\mathscr {L}_{\text {BCE}}\) approach. It is important to note that we have presented our results in both RA and BA, as prior studies did not report findings in BA. The benchmark results are presented in Table 4 and Table 5, with statistical significance evaluated using the McNemar test in Fig. 6. Overall, the our proposed technique exhibits superior performance compared to other CIMT approaches and significantly enhances the state-of-the-art model proposed by prior researchers. On average, there are substantial improvements (\(\mathscr {L}_{\text {BCE}}\) vs \(\mathscr {L}_{\text {FBCE-W}}\)) more than 7% for imbalanced datasets and 5% for balanced datasets, showing again the contribution of FL is not limited to class imbalance cases.
Although multiclass classification is not a commonly used approach for personality recognition, we included a study4 that utilized this method since it was the only one specifically focused on CIMT for personality recognition. In this research, the reported F1-score for multiclass classification (F1(mc)) is included alongside the binary classification F1-score (F1). To simulate multiclass classification, F1(mc) is calculated by multiplying the F1 score from all dimensions. The result shows that our F1 score improved by up to 11% compared to theirs, which, as discussed in our aforementioned empirical study, further supports our hypothesis that a multidimensional approach is more suitable than multiclass in personality recognition. Therefore, we again do not advocate for the use of multiclass classification for the aforementioned reasons.
The Sankey diagram shows the transition from the regular (\(\mathscr {L}_{\text {BCE}}\)) method (R) to the proposed AFL (\(\mathscr {L}_{\text {FBCE-M}}\)) method (P), evaluated against the original ground truth (G). To assess statistical significance, we use the McNemar p-value test. Since personality spans multiple dimensions, we flatten and concatenate them to simplify the visualization. The diagram shows that a substantial number of errors are corrected, with relatively few new mistakes introduced, highlighting the effectiveness of the proposed approach.
Discussion
To examine the contribution of each term in the focal loss and the effect of hyperparameter settings, we conducted an ablation study, as shown in Fig. 7. Specifically, we analyzed the impact through FBCE-F and FBCE-W on both the Essays and Kaggle datasets, using the average accuracy across all dimensions as the evaluation metric. There are two factors to rescale the loss: (i) the focal factor and (ii) the weight distribution factor. When \(\gamma =0\), the function loses the focal factor and collapses to \(\mathscr {L}_{\text {BCE}}\) (for FBCE-F) and \(\mathscr {L}_{\text {WBCE-S}}\) (for FBCE-W). Through analysis, we found that no clear correlation was observed between the \(\gamma\) parameter and model performance or training time in either dataset. Nonetheless, these results remain valuable for comparative studies to understand its performance.
The trend of the FBCE loss hyperparameter results in average accuracy (both BA and RA) and epochs (AE) across dimensions in the Essays and Kaggle datasets. The dotted lines for 7(a) and 7(b) show the baseline performance (BA, RA, and AE) of regular BCE and WBCE-S for 7(c) and 7(d). In 7(a), the FBCE-F focal factor shows an improvement in accuracy (RA and BA) regardless of the values chosen in a balanced dataset. In 7(b), the FBCE-F focal factor shows sensitivity to the hyperparameter chosen in an imbalanced dataset. In 7(c), the FBCE-W focal factor demonstrates an improvement in accuracy (RA and BA), regardless of the values chosen for the weighted factor. In 7(d), the FBCE-W focal factor shows that the model is sensitive to the \(\gamma\) in an imbalanced dataset; a well-chosen hyperparameter is vital for optimum performance.
As in Fig. 7(a)) and (c) (balanced Essays dataset), both BA and RA performance evident that the focal factor effectively filters out difficult samples in balanced dataset, enabling a more significant contribution to the cost function. The performances consistently surpasses the baseline, irrespective of the chosen \(\gamma\) value. This improvement arises from the focal factor’s effectiveness in balanced datasets, where it selectively emphasizes challenging samples, enhancing the overall learning process. However, it is noteworthy that the AE required for convergence does not decrease when using the focal factor, as it necessitates an additional step to adequately learn from the hard samples. On the other hand, as in Fig. 7(b) and (d) (imbalanced Kaggle dataset), the BA continued to improve when the focal factor was applied in conjunction with the class distribution factor, effectively making the weight of the loss from the label more balanced. The RA and BA results further demonstrate the focal factor’s ability to reduce model biases. Moreover, the AE consistently decreased after incorporating the focal factor. This improvement is attributed to the focal factor’s capability to focus the model’s attention on difficult samples, facilitating more rapid convergence. However, when compared to the baseline (\(\mathscr {L}_{\text {FBCE-W}}\)), the imbalanced Kaggle dataset shows a performance improvement only when an appropriate \(\gamma\) value is chosen. This is due to the class imbalance, which makes it difficult for the focal factor to identify hard samples, as the learning is dominated by the majority class. Hence, the attention is loosed and needs a suitable value setting on \(\gamma\) to scale down easy sample in the cost function. However, the focal factor helps in reducing the AE, meaning that it speeds up the training.
Hence, the strong sensitivity of the \(\gamma\) hyperparameter in imbalanced datasets highlights the limitation of relying on a fixed value. By making \(\gamma\) trainable and updating it during each iteration, we completely remove the need for manual tuning and allow the model to adaptively balance easy and hard examples. As hypothesized, introducing this mechanism with regularization (\(\mathscr {L}_{\text {FBCE-T}}\)) yields clear performance gains, with improvements of up to 2.38% on imbalanced data. This advantage is further illustrated in Fig. 7(b) and (d), where the Kaggle dataset shows distinct adaptive patterns in both the learned \(\gamma\) values and the resulting accuracy, compared to the Essays dataset. Together, these results provide strong evidence that a trainable \(\gamma\) is not only effective but also essential for robust performance under class imbalance.
Figure 8 illustrates the normalized loss trends across epochs on the validation set for the Essays and Kaggle datasets, respectively. Once again, we observe that the optimal AE improves with focal-based loss compared with other CIMT approaches, particularly on the Kaggle dataset, resulting in faster convergence. Moreover, the loss curve for focal-based approaches exhibits a rapid initial decline followed by a slower descent, reflecting the focal loss mechanism, which prioritizes learning from hard samples while gradually refining feature representation. In summation, the focal factor alone helps to speed up the training process, inducing faster convergence, while a weight scaling factor enables it to handle highly imbalanced classes more effectively. Additionally, a trainable \(\gamma\) improves adaptability by dynamically adjusting to the data distribution, further enhancing its potential to contribute to multi-task learning.
Moving on, we explored mathematical expressions to explain why CIMT can enhance model performance. As shown in Table 6 for impact analysis, we focus on \(\mathscr {L}_{\text {FBCE-W}}\), as \(\mathscr {L}_{\text {FBCE-T}}\) is a dynamic hyperparameter that technically behaves similarly to \(\mathscr {L}_{\text {FBCE-W}}\) during backpropagation in the MLP. Since \(\mathscr {L}_{\text {FBCE-T}}\) demonstrates better results, proving \(\mathscr {L}_{\text {FBCE-W}}\) effectiveness would also validate it in similar contexts. Here we adapt the confusion matrix [[1425, 70], [152, 88]] which was obtained from the most imbalanced dimension, ‘O’ (\(86.2\%\) is positive) in the Kaggle dataset that was predicted by \(\mathscr {L}_{\text {BCE}}\). In order to study the improvement of \(\mathscr {L}_{\text {WBCE-S}}\) and \(\mathscr {L}_{\text {FBCE-W}}\), we use the inverse sample distribution (the \(\alpha\) in Eq. 9) as the weight. True Positive, False Negative, True Negative, and False Positive results are treated as Easy Positive (EP), Hard Positive (HP), Easy Negative (EN), and Hard Negative (HN), respectively, as misclassifications represent difficult samples. Later, adapting the model output (\(\hat{y}\)) carry 0.75 and 0.25 for positive prediction and negative prediction. Lastly, the focal \(\gamma\) is set as 1.5 as in the experiment. Note that these assumptions are solely for reference, as it indicates the current baseline model prediction performance and the sample difficulty. These factors are to guide the model toward achieving the global optimum, and slight adjustments or rescaling do not significantly affect the results. We convert regular \(\mathscr {L}_{\text {BCE}}\), \(\mathscr {L}_{\text {WBCE-S}}\), and \(\mathscr {L}_{\text {FBCE-W}}\); as in Eq. 7, Eq. 9, and Eq. 10 to cost function, J. Later, we expressed it according to positive and label and further expand in terms of hardness (EP, HP, EN, HN) which are shown in Eq. 26, Eq. 27, and Eq. 28, where n, stands for the total number of samples.
By substituting the aforementioned value into Eq. 26, Eq. 27, and Eq. 28, we obtain the respective output (known as impact) according to the term. However, since there is a diminished effect due to scaling and to facilitate comparison across the CIMT model, we perform normalization to a scale of 0 to 1. The scaled impact is tabulated in Table 6. As a result, evident that \(\mathscr {L}_{\text {FBCE-W}}\) assigns more weight to hard negative samples and significantly lower weight to easy positive samples during model training. This clarifies why \(\mathscr {L}_{\text {FBCE-W}}\) is effective in increasing the accuracy of negative samples but may slightly decrease the accuracy of positive samples. Later, when we resubstitute new pair of confusion matrix value [[1198, 297], [60, 180]] that predicted by base model trained with \(\mathscr {L}_{\text {FBCE-W}}\) (rightmost column), we then noticed that the weight of hard positive terms increases, indicating that the focal factor is attempting to reweight the term. This highlights the dynamic nature of the focal factor in adapting its learning approach according to the circumstances.
Additionally, we analyze the computational complexity of our proposed approach in two aspects: (i) performance-efficiency, and (ii) mitigation-technique. Our model employs a TF-IDF representation with 5000 features (t) and utilizes a three-layer neural network with dimensions (5, 5, 1), incorporating an adaptive focal loss function. The neural network contributes a computational complexity of \(O(nt + nth_1 + n h_1 h_2 + n h_2 h_3)\), which simplifies to \(O(5000n + 25000n + 25n + 5n) = O(30030n)\) when substituting \(h_1 = h_2 = 5\) and \(h_3 = 1\). The adaptive focal loss adds only a minor additional complexity of \(O(n + 1)\) due to its weighting factor and a single trainable parameter adjustment. Thus, the total training complexity for our approach is effectively O(30031n).
In comparison to the BERT-based model application proposed by3, which is popular as a state-of-the-art approach for natural language processing tasks and generally involves substantial computational overhead due to its transformer architecture, our model can achieve comparable or even superior performance while requiring significantly lower computational cost owing to its simpler design. Its complexity, represented as \(O(n L h_{\text {BERT}}^2 + n L^2 h_{\text {BERT}})\), arises mainly from the self-attention mechanism (\(O(n L^2 h_{\text {BERT}})\)), the position-wise feedforward network (\(O(n L h_{\text {BERT}}^2)\)), and a negligible classification head complexity of O(1). Given that \(30031 \ll L h_{\text {BERT}}^2\), it is evident that our proposed method is significantly more computationally efficient. Moreover, in contrast to SMOTE4, which incurs a quadratic preprocessing cost of \(O(n_m^2t + r n_mt)\), where \(n_m\) denotes the number of minority class samples and r is the oversampling ratio, the additional complexity introduced by our adaptive focal loss is only O(1) per sample. Even up to 45 epochs (the maximum number of epochs in our experiments), this overhead remains negligible, and the overall asymptotic complexity of training is linear in the number of samples. Thus, SMOTE is dominated by the expensive k-NN search step, making it significantly more costly, particularly in high-dimensional feature spaces.
While our method demonstrates strong performance in addressing class imbalance and enhancing the model’s focus on hard examples, it may also be sensitive to mislabeled or noisy samples and prone to instability under extreme class imbalance. This sensitivity arises because the model amplifies the influence of these samples during backpropagation. Future work could explore strategies to mitigate these limitations, such as adaptive loss weighting or noise-robust training techniques. For example, approaches like Multi-Objective Manifold Representation for Opinion Mining and Grey Wolf Optimization60,61 could be investigated. In addition, we should also consider incorporating fairness-aware evaluation to examine potential biases across demographic subgroups, as well as extending the approach to multimodal personality prediction settings where textual, visual, and behavioral cues can complement one another.
Conclusion
In this paper, we conducted an empirical study on the proposed personality-stratified dataset splitting and label representation, which is significant for personality recognition, as it improves stability and learning efficiency by up to 9%. Later we introduced a novel Adaptive Focal Loss for multi-task personality recognition models and conducted a comprehensive analysis of various CIMTs to address hard class imbalance bias. On the Kaggle dataset, our method achieved a substantial 7% improvement in BA compared to conventional approach, even surpassing transformer-based model by 4.5%. Similarly, on the Essays dataset, we observed a 5% enhancement in BA over regular methods. Notably, our approach exhibits competitive performance with advanced transformer models, with about 1% performance gap. It is important to highlight that we use BA as the primary metric for comparison, as it better captures the performance disparities in imbalanced class distributions compared to regular accuracy metrics. Moreover, we strongly recommend using BA instead of the F1 score as a more reliable metric for evaluating models on datasets with imbalanced class distributions in personality recognition. We recommend that future research in CIMT adopt BA as a standard evaluation metric, as it offers greater insight into model performance in such scenarios. We demonstrate that optimizing the training process can make a greater impact on performance than simply increasing model size and computational power. This is particularly crucial for applications such as deploying the model in robots with limited computational resources.
Data availability
The related research resources, including the datasets and code repository, are available and can be accessed at: https://research.jingjietan.com/?q=AFLPS
References
Suwa, S. et al. Home-care professionals’ethical perceptions of the development and use of home-care robots for older adults in japan. Int. J. Human-Computer Interact. 36, 1295–1303. https://doi.org/10.1080/10447318.2020.1736809 (2020).
Salminen, J., Rao, R. G., Jung, S.-g., Chowdhury, S. A. & Jansen, B. J. Enriching social media personas with personality traits: A deep learning approach using the big five classes. In Artificial Intelligence in HCI: First International Conference, AI-HCI 2020, Held as Part of the 22nd HCI International Conference, HCII 2020, Copenhagen, Denmark, July 19–24, 2020, Proceedings 22, 101–120 (Springer, 2020).
Mehta, Y. et al. Bottom-up and top-down: Predicting personality with psycholinguistic and language model features. In 2020 IEEE International Conference on Data Mining (ICDM), 1184–1189, https://doi.org/10.1109/ICDM50108.2020.00146 (2020).
Cerkez, N. & Vareskic, V. Machine learning approaches to personality classification on imbalanced mbti datasets. 2021 44th International Convention on Information, Communication and Electronic Technology (MIPRO) 1259–1264, https://doi.org/10.23919/MIPRO52101.2021.9596742 (2021).
Kampman, O., Barezi, E. J., Bertero, D. & Fung, P. Investigating audio, video, and text fusion methods for end-to-end automatic personality prediction. Proc. 56th Annu. Meet. Assoc. for Comput. Linguist. (Volume 2: Short Pap.) 606–611, https://doi.org/10.18653/v1/P18-2096 (2018).
Furnham, A. Myers-briggs type indicator (mbti). Encycl. Pers. Individ. Differ. 1–4, https://doi.org/10.1007/978-3-319-28099-8_50-1 (2017).
Jang, K. L., Livesley, W. J. & Vemon, P. A. Heritability of the big five personality dimensions and their facets: A twin study. J. Pers. 64, 577–592. https://doi.org/10.1111/j.1467-6494.1996.tb00522.x (1996).
Costa, P. T., Terracciano, A. & McCrae, R. R. Gender differences in personality traits across cultures: Robust and surprising findings. J. Pers. Soc. Psychol. 81, 322–331. https://doi.org/10.1037/0022-3514.81.2.322 (2001).
Schmitt, D. P., Allik, J., McCrae, R. R. & Benet-Martinez, V. The geographic distribution of big five personality traits. J. Cross-Cultural Psychol. 38, 173–212. https://doi.org/10.1177/0022022106297299 (2007).
Obschonka, M., Schmitt-Rodermund, E., Silbereisen, R. K., Gosling, S. D. & Potter, J. The regional distribution and correlates of an entrepreneurship-prone personality profile in the united states, germany, and the united kingdom: A socioecological perspective. SSRN Electron. J. https://doi.org/10.2139/ssrn.2258392 (2013).
Guo, A. et al. Personality prediction from task-oriented and open-domain human-machine dialogues. Sci. Reports 14, 3868. https://doi.org/10.1038/s41598-024-53989-y (2024).
Heylighen, F. Self-organization in communicating groups: The emergence of coordination, shared references and collective intelligence. In Understanding Complex Systems, Understanding complex systems, 117–149 (Springer Berlin Heidelberg, Berlin, Heidelberg, 2013).
Yarkoni, T. Personality in 100,000 words: A large-scale analysis of personality and word use among bloggers. J. Res. Pers. 44, 363–373. https://doi.org/10.1016/j.jrp.2010.04.001 (2010).
Costa, E. et al. How the world changed social media (UCL Press, London, England, 2016).
Daisley, R. J. Considering personality type in adult learning: Using the myers-briggs type indicator in instructor preparation at pricewaterhousecoopers. Perform. Improv. 50, 15–24. https://doi.org/10.1002/pfi.20196 (2011).
Cao, X. & Kosinski, M. Large language models know how the personality of public figures is perceived by the general public. Sci. Reports 14, 6735. https://doi.org/10.1038/s41598-024-57271-z (2024).
Simari, G. I., Martinez, M. V., Gallo, F. R. & Falappa, M. A. The big-2/rose model of online personality. Cogn. Comput. 13, 1198–1214. https://doi.org/10.1007/s12559-021-09866-1 (2021).
Cervone, D. & Beck, E. D. Theoretical and methodological issues in personality research. The Wiley Encycl. Pers. Individ. Differ. 1–11, https://doi.org/10.1002/9781118970843.ch71 (2020).
Weaver, J. & Kiewitz, C. Eysenck personality questionnaire. Handb. Res. on Electron. Surv. Meas. 360–363, https://doi.org/10.4018/978-1-59140-792-8.ch052 (2007).
Maragakis, A. Eysenck personality questionnaire-revised. The Wiley Encycl. Pers. Individ. Differ. 283–286, https://doi.org/10.1002/9781119547167.ch119 (2020).
Myers, I. B. The myers-briggs type indicator: Manual (1962)., https://doi.org/10.1037/14404-000 (1962).
Goldberg, L. R. The structure of phenotypic personality traits. Am. Psychol. 48, 26–34. https://doi.org/10.1037/0003-066X.48.1.26 (1993).
Ashton, M. C. et al. A six-factor structure of personality-descriptive adjectives: Solutions from psycholexical studies in seven languages. J. Pers. Soc. Psychol. 86, 356–366. https://doi.org/10.1037/0022-3514.86.2.356 (2004).
Furnham, A. The big five versus the big four: the relationship between the myers-briggs type indicator (mbti) and neo-pi five factor model of personality. Pers. Individ. Differ. 21, 303–307. https://doi.org/10.1016/0191-8869(96)00033-5 (1996).
Celli, F. & Lepri, B. Is big five better than mbti?. Proceedings of the Fifth Italian Conference on Computational Linguistics CLiC-it 93–98, 2018. https://doi.org/10.4000/books.aaccademia.3147 (2018).
Tan, J. J., Kwan, B.-H., Ng, D. W.-K. & Hum, Y. C. & Mokraoui, A. & Lo, S.-Y. Prompting-in-a-series: Psychology-informed contents and embeddings for personality recognition with decoder-only models. IEEE Trans. Comput. Soc. Syst. 1–15. https://doi.org/10.1109/tcss.2025.3593323 (2025).
Michel, J.-B. et al. Quantitative analysis of culture using millions of digitized books. Science 331, 176–182. https://doi.org/10.1126/science.1199644 (2011).
Mirdjanovna, K. S. Finite state machine model for uzbek language morphological analyzer. 2021 6th Int. Conf. on Comput. Sci. Eng. (UBMK) 395–400, https://doi.org/10.1109/UBMK52708.2021.9559023 (2021).
Jabbar, A., Iqbal, S., Tamimy, M. I., Hussain, S. & Akhunzada, A. Empirical evaluation and study of text stemming algorithms. Artif. Intell. Rev. 53, 5559–5588. https://doi.org/10.1007/s10462-020-09828-3 (2020).
Müller, T., Cotterell, R., Fraser, A. & Schütze, H. Joint lemmatization and morphological tagging with lemming. Proc.2015 Conf. on Empir. Methods Nat. Lang. Process. 2268–2274, https://doi.org/10.18653/v1/D15-1272 (2015).
Sharipov, M. & Sobirov, O. Development of a rule-based lemmatization algorithm through finite state machine for uzbek language. arXiv preprint arXiv:2210.16006 (2022).
Li, L. & Qiu, X. Token-aware virtual adversarial training in natural language understanding. Proc. AAAI Conf. on Artif. Intell. 35, 8410–8418. https://doi.org/10.1609/aaai.v35i9.17022 (2021).
Bhatta, J., Shrestha, D., Nepal, S., Pandey, S. & Koirala, S. Efficient estimation of nepali word representations in vector space. J. Innov. Eng. Educ. 3, 71–77. https://doi.org/10.3126/jiee.v3i1.34327 (2020).
Beckwith, R., Fellbaum, C., Gross, D. & Miller, G. A. Wordnet: A lexical database organized on psycholinguistic principles*. Lexical Acquisition: Exploiting On-Line Resources to Build a Lexicon 211–232, https://doi.org/10.4324/9781315785387-12 (2021).
Guzmán Cabrera, R. & Hernández Farias, D. I. Exploring the use of lexical and psycho-linguistic resources for sentiment analysis. In Advances in Computational Intelligence: 19th Mexican International Conference on Artificial Intelligence, MICAI 2020, Mexico City, Mexico, October 12–17, 2020, Proceedings, Part II 19, 109–121 (Springer, 2020).
Tausczik, Y. R. & Pennebaker, J. W. The psychological meaning of words: Liwc and computerized text analysis methods. J. Lang. Soc. Psychol. 29, 24–54. https://doi.org/10.1177/0261927X09351676 (2010).
Tan, J. J., Kwan, B.-H., Ng, D. W.-K. & Hum, Y. C. Psychology-informed natural language understanding: Integrating personality and emotion-aware features for comprehensive sentiment analysis and depression detection. Pertanika J. Sci. Technol. 33. https://doi.org/10.47836/pjst.33.s4.04 (2025).
Mairesse, F., Walker, M. A., Mehl, M. R. & Moore, R. K. Using linguistic cues for the automatic recognition of personality in conversation and text. J. Artif. Intell. Res. 30, 457–500. https://doi.org/10.1613/jair.2349 (2007).
Majumder, N., Poria, S., Gelbukh, A. & Cambria, E. Deep learning-based document modeling for personality detection from text. IEEE Intell. Syst. 32, 74–79. https://doi.org/10.1109/MIS.2017.23 (2017).
Rahman, M. A., Faisal, A. A., Khanam, T., Amjad, M. & Siddik, M. S. Personality detection from text using convolutional neural network. 2019 1st Int. Conf. on Adv. Sci. Eng. Robotics Technol. (ICASERT) 1–6, https://doi.org/10.1109/ICASERT.2019.8934548 (2019).
Tinwala, W. & Rauniyar, S. Big five personality detection using deep convolutional neural networks, https://doi.org/10.20944/preprints202109.0199.v1 (2021).
Ontoum, S. & Chan, J. H. Personality type based on myers-briggs type indicator with text posting style by using traditional and deep learning. arXiv preprint arXiv:2201.08717 (2022).
Amirhosseini, M. H. & Kazemian, H. Machine learning approach to personality type prediction based on the myers-briggs type indicator®. Multimodal Technologies and Interaction 4, 9. https://doi.org/10.3390/mti4010009 (2020).
Hernandez, R. & Scott, I. Predicting myers-briggs type indicator with text. In 31st Conference on neural information processing systems (NIPS 2017) (2017).
Ramezani, M., Feizi-Derakhshi, M.-R. & Balafar, M.-A. Text-based automatic personality prediction using kgrat-net: a knowledge graph attention network classifier. Sci. Reports 12, https://doi.org/10.1038/s41598-022-25955-z (2022).
Lin, T.-Y., Goyal, P., Girshick, R., He, K. & Dollar, P. Focal loss for dense object detection. 2017 IEEE Int. Conf. on Comput. Vis. (ICCV) 2999–3007, https://doi.org/10.1109/ICCV.2017.324 (2017).
Deimann, R., Preidt, T., Roy, S. & Stanicki, J. Is personality prediction possible based on reddit comments?, 10.48550/ARXIV.2408.16089 (2024).
Pennebaker, J. W. & King, L. A. Linguistic styles: Language use as an individual difference. J. Pers. Soc. Psychol. 77, 1296–1312. https://doi.org/10.1037/0022-3514.77.6.1296 (1999).
J., M. (mbti) myers-briggs personality type dataset (2017).
Yuan, Y., Wu, L. & Zhang, X. Gini-impurity index analysis. IEEE Transactions on Inf. Forensics Secur. 16, 3154–3169. https://doi.org/10.1109/tifs.2021.3076932 (2021).
Jing Jie, Tan. essays-big5 (revision ac1977f), https://doi.org/10.57967/hf/3956 (2025).
Jing Jie, Tan. kaggle-mbti (revision 46036ad), https://doi.org/10.57967/hf/3955 (2025).
Miller, G. A. WordNet: A lexical database for English. In Human Language Technology: Proceedings of a Workshop held at Plainsboro, New Jersey, March 8-11, 1994 (1994).
Shawky, P. E., ElKaffas, S. M. & Guirguis, S. K. Effect of typos on text classification accuracy in word and character tokenization. J. Adv. Res. Appl. Sci. Eng. Technol. 40, 152–162. https://doi.org/10.37934/araset.40.2.152162 (2024).
Vennapusa, B. R. bharathvennapusa - Overview — github.com. https://github.com/bharathvennapusa. [Accessed 12-02-2025].
Mohammed, R., Rawashdeh, J. & Abdullah, M. Machine learning with oversampling and undersampling techniques: Overview study and experimental results. In 2020 11th International Conference on Information and Communication Systems (ICICS), 243–248, https://doi.org/10.1109/ICICS49469.2020.239556 (2020).
Chawla, N. V., Bowyer, K. W., Hall, L. O. & Kegelmeyer, W. P. Smote: Synthetic minority over-sampling technique. J. Artif. Intell. Res. 16, 321–357. https://doi.org/10.1613/jair.953 (2002).
Kendall, A., Gal, Y. & Cipolla, R. Multi-task learning using uncertainty to weigh losses for scene geometry and semantics 1705, 07115 (2018).
Elor, Y. & Averbuch-Elor, H. To smote, or not to smote? (2022). 2201.08528.
Rahman, P., Daneshfar, F. & Parvin, H. Multi-objective manifold representation for opinion mining. Expert. Syst. 42(8), https://doi.org/10.1111/exsy.70092 (2025).
Hassan, E., Saber, A., El-Sappagh, S. & El-Rashidy, N. Optimized ensemble deep learning approach for accurate breast cancer diagnosis using transfer learning and grey wolf optimization. Evol. Syst. 16(2), 59. https://doi.org/10.1007/s12530-025-09686-w (2025).
Acknowledgements
This work was supported by Universiti Tunku Abdul Rahman Research Fund (IPSR/RMC/UTARRF/2021-C1/K03). The first author would like to thank Yi Jie Wong for the invaluable feedback on this research. Thanks also to the Jing Jie’s peers, whose copious suggestions greatly improved this paper.
Author information
Authors and Affiliations
Contributions
J. T. conceived and conducted the experiments, analyzed the results and wrote the manuscript. B. K., D. N., and Y. H. provided critical feedback and revised the manuscript.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Tan, J.J., Kwan, BH., Ng, D.WK. et al. Adaptive focal loss with personality stratification for stably mitigating hard class imbalance in multi-dimensional personality recognition. Sci Rep 15, 39241 (2025). https://doi.org/10.1038/s41598-025-22853-y
Received:
Accepted:
Published:
Version of record:
DOI: https://doi.org/10.1038/s41598-025-22853-y








