Introduction

Hope is a powerful emotion or state of mind that significantly impacts human psychology and our ability to cope with life’s challenges1. Often associated with optimism, hope is characterized by an expectation of desired outcomes and the eventual fulfillment of desires or goals2. Researchers categorize hope broadly into two main types: “generalized” and “particularized” hope. These categories differ based on their targeted goals, the underlying cognitive and emotional processes, and the resulting behaviors3,4,5.

Generalized hope refers to a sense of positive future developments but is not necessarily linked to a specific goal/outcome, object, or quality; rather, it fosters a broad optimism about the future6. It plays a crucial role in building resilience. Generalized hope helps individuals maintain a general positive outlook in the face of adversity, encouraging them to persevere through challenges even when the specific goal is unclear7. For example, the sentence “Despite the challenges we face, I remain hopeful that brighter days are ahead of us” is an example of generalized hope because it does not specify a particular goal or outcome. Instead, it expresses a broad, positive outlook toward the future, regardless of current difficulties. This kind of hope is not tied to a specific event or result but rather a belief in overall improvement and positivity. It reflects resilience and the ability to maintain hopefulness in the face of adversity, which are key aspects of generalized hope.

While research on generalized hope is limited, Maretha8 and O’Hara et al.6 define particularized hope as a hope that focuses on achieving or acquiring a specific goal or outcome. They explore this type of hope as a combination of personal emotions and mental activities related to desire. They argue that particularized hope involves a feeling of encouragement fueled by desire and the formation of expectations, which can be either realistic or unrealistic.

Realistic hope focuses on achieving a probable outcome. This type of hope involves mental imagery and an assessment of the likelihood of occurrence to maintain a connection with reality7. For example, a student who studied hard all week might think, “I believe I can pass the exam,” due to their realistic hope.

On the other hand, unrealistic hope is based on incomplete or inaccurate information and the expectation of improbable events9. An example could be a student with poor grades who thinks, “Everyone thinks I failed, but a miracle happens every day”. Different types of hope can lead to varied behaviors1. Particularized hope, centered on achieving a specific goal, often motivates individuals to engage in targeted actions to reach that goal. This type of hope can drive a person to plan and execute steps that directly contribute to the desired outcome, like a student studying diligently for an exam they aim to pass. The specificity and clarity of the goal in particularized hope can provide a clear direction for action.

Distinguishing between different hope types, such as goal-oriented “Particularized” hope and broader “Generalized” hope, provides a multifaceted perspective on human behavior. This understanding allows us to predict behaviors (specific hope motivates action, while generalized hope fosters resilience)10, guide therapy (realistic hope can support interventions), promote well-being (understanding how hope types contribute to mental health is crucial)10, and inform personal growth (recognizing realistic hope helps in setting achievable goals and avoiding disappointment)1.

Over the past few years, several natural language processing (NLP) tasks, such as sentiment analysis, emotion detection, hate speech identification, abusive language detection, misogyny detection11, and more, have gained prominence for studying social media content and online interactions12. However, despite the significance of hope in human psychology and decision-making, only a limited number of studies have been conducted to explore the various types of hopes and expectations shared on social media13,14. Hope speech detection as an NLP task was introduced by Chakravarthi14 as a binary classification task that categorizes YouTube comments - written in English and code-mixed Dravidian languages - regarding conflicts between India and Pakistan into two predefined categories: “Hope,” representing positivity and support, and “Not Hope,” encompassing neutral or offensive comments. This definition provided a fundamental understanding of hope, further refined by Balouchzahi et al.13. In their work, Balouchzahi et al.13 developed a multiclass hope speech detection dataset that classifies English tweets into four categories: “Generalized Hope”, “Realistic Hope”, “Unrealistic Hope”, and “Not Hope”. The definitions and some samples for each label are provided in Sect. “Annotation guidelines”.

Inspired by13, and to promote multiclass hope speech detection across various languages, we expand the PolyHope dataset tailored to Spanish and German. This expansion is underpinned by several compelling reasons, marking a pivotal stride in fostering multiclass hope speech detection across linguistic boundaries.

Firstly, by incorporating Spanish, acknowledged as the second most widely spoken language by native speakers globally15, and German, a predominant language in Europe16, we enrich the dataset’s diversity. This addition facilitates the encompassment of diverse cultural nuances in hope expression, thereby augmenting the dataset’s representativeness.

Moreover, this extension addresses a notable void in research resources about hope speech detection in Spanish and German. By providing datasets tailored to these languages, researchers can now delve into the intricacies of hope expression within these linguistic contexts, fostering deeper insights and more accurate analyses.

To the best of our knowledge, this is the first study to create multiclass hope speech datasets for both Spanish and German, building upon and significantly extending the original PolyHope framework developed for English. Unlike previous studies that focus primarily on binary classification or English-only datasets, our work introduces multilingual, fine-grained annotations that account for cultural and linguistic variation in hope expression.

The main contributions of this research paper are listed below:

  • Development of first-ever multiclass hope speech detection datasets for Spanish and German languages,

  • A systematic review of existing datasets and techniques on hope speech detection,

  • Comparative analysis of different learning approaches for hope speech detection for different languages,

  • Comparative analysis of multilingual transformers for multilingual hope speech detection.

Related work

Hope speech detection is a recently emerged field of research; in this section, we present a review of different proposals that address binary and multiclass classification datasets and techniques. Additionally, we examine monolingual, cross-lingual, and multilingual paradigms. The main purpose of this review is to give an account of what has been done up to now and point out where more work should be done in hope speech detection.

Hope speech detection datasets

Table 1 provides a summary of all available hope speech datasets along with their key characteristics. A detailed description of each dataset is presented in the following sections.

The HopeEDI dataset, as introduced by Chakravarthi14, was the first multilingual resource for detecting hope speech within the context of Equality, Diversity, and Inclusion. It comprises user-generated comments from YouTube, with 28,451 comments in English, 20,198 in Tamil, and 10,705 in Malayalam, each manually annotated to indicate the presence or absence of hope speech as a binary classification task. This dataset uniquely emphasizes content that is encouraging, positive, and supportive, making it the first to address hope speech in a multilingual framework. Subsequent work by Chakravarthi14, García-Baena et al.17 and Hande et al.18 expanded the dataset to encompass Spanish and code-mixed Kannada-English binary hope speech detection datasets respectively.

García-Baena et al.17 created a balanced dataset of 1,650 Spanish LGBT-related tweets, labelled for Hope Speech and Non-Hope Speech. In baseline experiments, they used traditional ML algorithms (SVM, Multinomial Naive Bayes (MNB), LR) and DL models (Multilayer Perceptron (MLP), Convolutional Neural Network (CNN), Bidirectional Long Short-Term Memory (BiLSTM)) with various features (TF-IDF, word embeddings, BERT embeddings). They reported that the fine-tuned BETO model achieved the best performance with an F1-score of 85.12%.

Hande et al.18 introduce a dataset called KanHope, a dataset for detecting hope speech in code-mixed Kannada-English texts collected from YouTube. The proposal seeks to address the lack of enough studies on spreading positivity through content such as hope speeches in languages that do not have many resources allocated towards them. To achieve this, the researchers proposed DC-BERT4HOPE, a dual-channel BERT-based model that uses both original code-mixed data as well as its English translation and achieves the highest performance with a Weighted F1-score of 0.752.

Nath et al.19 presented the first dataset developed specifically to identify hope speech in the Bengali language. Deep learning (DL) models such as CNNs and BiLSTMs and transformers models such as Bangla BERT and Multilingual BERT were evaluated using this dataset. The best performance came from MuRIL, which is a transformer model specifically designed for multiple Indian languages based on BERT, reaching an Average Macro F1-score of 0.8498.

Malik et al.20 addressed Multi-lingual Hope Speech Detection in English and Russian using transfer learning with a fine-tuning approach. It compares joint multilingual and translation-based methods, with the latter translating all content into one language for classification. The study presents a Russian corpus of YouTube comments, and their experiments using bilingual datasets show that the proposed framework outperforms baselines, with the translation-based approach achieving the best performance at 94% in accuracy and an 80.24% in F1-score.

Balouchzahi et al.13 presented an innovative method to recognize and classify hope speech in English tweets by creating a corpus called ’PolyHope’, which has two levels of categorization. The first level distinguishes between ’Hope’ and ’Not Hope’, while at the second level, hope speech is further classified into Generalized Hope, Realistic Hope, and Unrealistic Hope. Various models were used, such as traditional machine learning (ML) models, DL models, or transformer models like BERT, among others; according to their experiment results, transformers performed better than the other models, reaching a Macro F1-score of 0.85 in binary classification and 0.72 in multiclass classification.

Later, Balouchzahi et al.21 set out to detect hope and hopelessness in Urdu tweets by creating the first Urdu annotated dataset with five classes, namely Generalized Hope, Unrealistic Hope, Realistic Hope, Not Hope and Hopelessness. Different ML models such as Logistic Regression, BiLSTM, Convolutional Neural Networks, and transformer models (BERT and RoBERTa) have been benchmarked in this paper. It was observed that while simpler algorithms like LR worked best when it came to binary classification, reaching an Average Macro F1-score of 0.7593, transformers outperformed them at more fine-grained multi-class labeling, reaching an Average Macro F1-score of 0.4801, with RoBERTa.

Table 1 Summary of Hope speech detection datasets in the literature.

Hope speech detection techniques

The earlier mentioned datasets13,14,17,18 were used to organize several shared tasks in which these datasets were available to researchers and several findings were reported on them. Therefore, in this section we provide a summary of models submitted in two of the most recent shared tasks called HOPE organized in IberLEF 2023 and 202422,23.

In the HOPE shared task at IberLEF 202322, two binary classification datasets were provided with the goal of promoting research in identifying uplifting and encouraging speech in online communications. A total of 12 teams participated in the Spanish subtask, while 11 teams participated in the English subtask. The Spanish dataset with 1,312 training samples, 300 development samples, and 450 test samples, featuring a balanced distribution of hope speech (691 training samples) and non-hope speech (621 training samples), which contributed to better performance. In contrast, the English dataset was significantly larger, with 22,651 training samples, but it was heavily imbalanced, containing only 21 hope speech samples compared to 20,690 non-hope speech samples, which posed challenges for model training and resulted in lower performance scores in the English subtask due to the difficulty in effectively learning to identify hope speech.

In the HOPE 2023 shared task22, various teams employed different methodologies to tackle the detection of hope speech. The I2C-Huelva team24 utilized BERTuit, a transformer model specifically designed for Spanish, achieving an average macro F1-score of 0.501 in the English subtask and 0.481 in the Spanish subtask. The Habesha team25 implemented a Support Vector Machine (SVM) model using term frequency and inverse document frequency (TF-IDF) for feature selection, reporting an average macro F1-score of 0.489 for English and 0.481 for Spanish. The top-performing team in the Spanish subtask, Zootopi26, achieved an impressive average macro F1-score of 0.916, while the best result in the English subtask was 0.501 by I2C-Huelva24, highlighting the challenges faced in the English dataset due to its imbalance. Overall, transformer-based models were predominant among the top teams, although some teams also explored classic ML techniques. Table 2 presents a summary of the teams that participated in the HOPE shared task, highlighting their learning approaches, methods, and performance metrics (F1 scores) for both Spanish and English subtasks.

Table 2 Teams and Their Approaches in the HOPE 2023 Shared Task.

The most recent iteration of the HOPE 2024 shared task23 focused on detecting hope speech from two perspectives. Task 1 emphasized equality, diversity, and inclusion, aiming to identify hope speech within texts related to the LGTBI community. Participants were provided with a training corpus and evaluated on both known and unknown domains, classifying texts as either Hope Speech (HS) or Not Hope (NH). Task 2 introduced two key differences: (i) it examined hope as a form of expectation, and (ii) it adopted a multiclass classification approach. In this task, texts were first categorized as hope or not hope, with further distinctions made between General Hope (GH), Realistic Hope (RH), and Unrealistic Hope (URH). Titled PolyHope, Task 2 explored hope as expectations across multiple domains and languages (English and Spanish), leveraging a larger dataset sourced from social media.

In this shared task, various teams employed diverse methodologies to tackle the detection of hope speech across both tasks. Similar to the previous one (HOPE 2023 shared task22), teams utilized a range of ML approaches, including traditional classifiers like Logistic Regression (LR), Support Vector Machine (SVM) and advanced DL techniques such as Transformers and LLM-based models.

A significant number of teams, such as NTT@UIT32, ChauPhamQuocHung33, ABCD Team34, Ometeotl35, and UMUTeam36, leveraged pre-trained transformer models like BERT, BETO, and mBERT, etc. These models, often fine-tuned on specific datasets and combined with different prompting techniques, enabled these teams to achieve top rankings in both task 1 and task 2. The NTT@UIT team32 led in task 1 with their use of ChatGPT-3.5 and advanced prompting techniques, while the ABCD Team34 excelled in task 2 by experimenting with ensemble learning and additional preprocessing.

On the other hand, teams like CUFE37, CICPAK38, and VEL39 explored more traditional ML approaches, including methods like Gzip compression and SVM, often combined with text preprocessing. Although these methods were less commonly used, CUFE’s Gzip compression approach notably performed well, almost matching the performance of more advanced transformer-based methods in task 1.

A few teams, such as MUCS40 and Grima41, experimented with a mix of traditional ML models and transfer learning techniques. While their results were moderate, they demonstrated the potential of combining diverse algorithms, including ensemble methods, to improve classification performance. Lastly, some teams, like AmnaNaseeb42 and Lemlem43, relied more on traditional models and deep neural networks, participating mainly in task 2 with varying degrees of success, often focusing on binary and multiclass classifications in English. These approaches, although less sophisticated, highlighted the continuing relevance of traditional methods in specific tasks. A summary of techniques used and their results for HOPE 2024 shared task are presented in Table 3.

Table 3 Teams and their approaches in HOPE 2024 shared task.

Dataset development

Data collection and processing

The English keywords from the PolyHope dataset13 were translated into Spanish and German. Native speakers meticulously validated these translations to ensure comprehensive coverage of terms and their variations used in similar contexts. Tweepy API [https://docs.tweepy.org/en/stable/api.html] was used for data collection, and initially, approximately 33,330 raw tweets for German and 82,725 for Spanish were collected over an open time frame using these keywords. Then, the 50,000 most recent tweets for each language as of March 2022 were gathered and added to the keyword-based tweets. Basic preprocessing steps were applied, such as de-identifying mentions and URLs, removing duplicate tweets, and filtering out retweets and tweets with fewer than five words or containing only emojis, mentions, or URLs. The keywords used for scraping in each language and statistics of the raw data collected are presented in Table 4.

Table 4 Keywords used for data collection and statistics of unlabelled data.

Annotation guidelines

Detailed instructions were given to the annotators to assist them throughout the annotation process. The process comprised two primary steps: initially identifying tweets that conveyed hope and subsequently categorizing the type of hope into predefined classifications. Borrowed from13, the following presents a brief outline of the task and guidelines for the two-level hope speech detection. These guidelines were accompanied by labeled samples provided by the authors, some of which are listed in Table 5.

  • Binary Hope Speech: in this level, each given tweet is classified as Hope or Not Hope.

    • Not Hope: The tweet does not express future-oriented hope, wish, or desire.

    • Hope: There is an expression of a future-oriented desire, wish, or want for something to happen or to be true.

  • Multiclass Hope Speech: in this level, the pre-identified tweet with expression of hope further will be fine-grained into following categories:

    • Generalized Hope: This type of hope reflects a general sense of optimism and hopefulness without being directed towards any specific event or outcome6,7.

    • Realistic Hope: This type of hope centers around the expectation of something reasonable, meaningful, and potentially achievable. It involves hoping for outcomes that are within the realm of possibility and likelihood, encompassing both regular and expected occurrences44,45.

    • Unrealistic Hope: This form of hope is characterized by a very low likelihood of occurrence, or even the absence of such a possibility. It often involves wishing for something to come true despite its remote or nonexistent chances of happening. Occasionally, unrealistic hopes may arise from emotions like anger, sadness, or depression, lacking any rational basis46.

Table 5 Annotated samples of MIND-HOPE dataset (GH: Generalized Hope, RH: Realistic Hope, URH: Unrealistic Hope).

Inter-annotator agreement (IAA)

The annotation process involved three annotators per language, each with at least a postgraduate background in computer science and experience in NLP tasks and annotations. The final label for each tweet was determined by majority voting among the annotators.

To assess agreement between the annotators, Cohen’s Kappa Coefficient47 was used for binary classification, while Fleiss’ Kappa48 was applied for multiclass classification. The Inter-Annotator Agreement (IAA) scores for the binary classification were 0.7814 for the Spanish dataset and 0.8524 for the German dataset. For the multiclass classification, the IAA scores were 0.7095 for Spanish and 0.8103 for German.

Table 6 Statistics of annotated datasets.

Statistics of the datasets

The final statistics of datasets for Spanish and German languages are presented in Table 6. The statistics show that similar to the existing English hope speech dataset13, label distribution over the datasets is very imbalanced. The distribution of labels over the datasets for each language is presented in Figure 1.

Fig. 1
figure 1

Label distribution over datasets (NH: Not Hope, GH: Generalized Hope, RH: Realistic Hope, URH: Unrealistic Hope).

Benchmarks

In this section, we describe the experimental setup for benchmarking on the proposed datasets, including the use of cross-validation. Specifically, we employed a 5-fold cross-validation approach in all our experiments. This method involves splitting the dataset into five equal parts, where each part is used as a validation set while the remaining four parts serve as the training set. The model is trained and evaluated five times, each time with a different fold as the validation set. The results from these iterations are then averaged to produce the final scores.

Preprocessing

The preprocessing steps for all experiments (except transformers) involved several crucial steps to ensure the text is in a clean and uniform format suitable for subsequent analysis. Therefore, for traditional and DL models, a standard preprocessing pipeline was applied, including lowercasing, diacritic and special character removal, stopword filtering, tokenization, and lemmatization using SpaCy. These steps help reduce sparsity and standardize text input, especially for models relying on bag-of-words or n-gram features49,50. In contrast, for transformer-based models, we avoided external preprocessing (except lowercasing for compatibility with uncased models), as these models include subword tokenization mechanisms (e.g., WordPiece or SentencePiece) that are tightly coupled with their pretrained vocabularies51. Altering input text structure can negatively affect transformer performance due to misalignment with pretraining distributions. Below is a detailed description of each step in the preprocessing pipeline:

  • Lowercasing: Texts for all languages were lowercased to ensure uniformity and reduce the complexity arising from case differences.

  • Accent and Diacritic Removal: The unicodedata library was utilized to strip accents and other diacritics from characters. This step handled the various forms of accented characters commonly found, especially in Spanish text.

  • Special Character Removal: Removes special characters, punctuation marks, and numbers, retaining only alphabetic characters and spaces. This is performed using regular expressions to ensure the text is free from extraneous symbols that could interfere with text analysis.

  • Tokenization: We employed the nltk.word_tokenize function to split the normalized text into individual words (tokens). This step was language-specific and ensured that the text is broken down into its constituent parts, facilitating further processing and analysis.

  • Stop Words Removal: Language-specific stopword lists were used to filter out common words that do not contribute significant meaning to the text (e.g., in Spanish, “el’, “y”, “de”). This step reduces noise and focuses on the more informative words within the text.

  • Lemmatization: This step leveraged SpaCy’s language models to perform lemmatization, converting each token to its base or dictionary form. This process accounted for variations in word forms, such as tense, mood, and number, and consolidated them into a single, standardized form, enhancing the consistency and comparability of the text data.

Traditional machine learning models

The experiments using traditional ML models encompassed a series of steps. Initially, the input texts underwent preprocessing described in Sect. “Preprocessing” to enhance their suitability for classification. Then, various n-grams were extracted from these preprocessed texts and then transformed into TF-IDF features using TfidfVectorizer, enabling the representation of text data in a numerical format conducive to modelling. The models were experimented with unigrams, bigrams, and trigrams alone and their combinations.

Benchmarking a newly developed dataset with straightforward methods is an essential step in the research process. It establishes consistent reference points, enabling reliable evaluation across various methods or algorithms. This approach ensures that the comparisons made are both fair and meaningful52. Therefore, a collection of ML models are used, including Logistic Regression, SVM with Linear, and also RBF Kernels, Multinomial Naive Bayes, Decision Tree, and Random Forest -all with default parameters- that serve as traditional ML baselines.

Deep learning models

DL experiments were conducted utilizing two neural network-based models and fastText embeddings for all languages involved in the study. fastText embeddings are publicly available in numerous languages, including those used in our research, making them an ideal choice. Further, fastText embeddings were chosen for this study because they provide word representations that capture both syntactic and semantic information, and they are trained on large corpora covering various languages53,54.

Recent sentiment analysis and text classification studies have shown that combining static and contextual embeddings can enhance classification accuracy55, which supports our choice of integrating fastText embeddings in DL experiments.

This comprehensive coverage ensures that the embeddings are robust and applicable across different linguistic contexts, enhancing the performance of our neural network models.

The description of neural network-based models used in this study are presented as follow.

Table 7 Summary of layers in the CNN model.

Convolutional neural network

Convolutional Neural Network (CNN) is well-suited for capturing spatial hierarchies in data. CNNs are particularly effective in processing grid-like data structures such as text sequences, enabling the model to extract significant features through convolutional operations56. This capability makes CNNs advantageous for tasks such as text classification and natural language processing.

The structure of the CNN model used in this experiment starts with an embedding layer that transforms input tokens into dense vector representations using pre-trained fastText embeddings. This embedding layer is initialized with a matrix of fastText 300D embeddings. Following the embedding layer, the model employs several convolutional layers to extract features from the text. Each convolutional layer uses multiple filters (kernels) of varying sizes to capture different n-gram features. Specifically, the model applies filters with sizes [1, 2, 3, 5], corresponding to unigrams, bigrams, trigrams, and 5-grams. Each filter size has 36 filters, which generate feature maps for different patterns in the text. To prevent overfitting and enhance model generalization, a dropout layer is included with a dropout rate of 0.1. This means that 10% of the neurons are randomly set to zero during training, which helps to reduce the model’s dependency on specific features and improves its robustness.

The output from the convolutional layers is then passed through a fully connected layer. The input size to this layer is determined by the product of the number of filter sizes and the number of filters, representing the concatenated feature maps from all convolutional layers. The fully connected layer maps these features to the output classes, which were 2 for binary and 4 for multiclass tasks.

In the forward pass, the process begins with converting input token indices into dense vectors via the embedding layer. The embedded input is then reshaped to add a channel dimension, making it compatible with the convolutional layers. Convolutional operations followed by ReLU activation extract and enhance feature maps, which are then pooled using max pooling to reduce dimensionality. The pooled features from all convolutional layers are concatenated into a single vector, which is subsequently passed through a dropout layer to avoid overfitting. Eventually, the feature vector is fed into the fully connected layer to generate logits, representing the class scores for classification. A summary of layers for the CNN model is shown in Table 7.

Bidirectional long short-term memory

The other NN-based model employed in this study was a Recurrent Neural Network (RNN), specifically a Bidirectional Long Short-Term Memory (BiLSTM) network. RNNs are designed to handle sequential data by maintaining a form of memory that captures information about previous elements in the sequence. LSTMs, a type of RNN, address the vanishing gradient problem, allowing the network to learn long-term dependencies in the data57,58. This characteristic is essential for understanding context and sequence in natural language tasks.

The architecture of the BiLSTM model in this study begins with an embedding layer that converts input words, represented as indices, into dense vectors of fixed size. These embeddings, which are pre-trained and remain static during training, provide rich semantic information that enhances the model’s ability to understand and process text. By using fastText 300D embeddings, the model captures word meanings based on their context in large corpora, which is especially beneficial when working with limited task-specific data.

At the heart of the model is the BiLSTM layer, which processes the sequence of word embeddings to capture contextual information from both past and future directions within the text. This bidirectional approach allows the model to understand dependencies between words, regardless of their position in the sequence. The LSTM’s hidden layer has 64 units, and the bidirectional nature effectively doubles the hidden state size. The processed sequence is then subjected to two pooling operations-average pooling and max pooling-which summarize the sequence information into fixed-size vectors. These pooled vectors are concatenated to form a single, richer representation of the text.

Following the concatenation, the model applies a linear transformation to reduce the dimensionality of the concatenated vector, which has been designed to accommodate the input from both pooling strategies. A ReLU activation function introduces non-linearity, helping the model to capture complex patterns in the data. To prevent overfitting, dropout is applied with a rate of 0.1, randomly setting 10% of the input units to zero during training. This regularization technique ensures that the model learns more robust features that do not rely too heavily on specific neurons. The final stage of the model involves an output layer that maps the features extracted by the previous layers to the target classes. This layer produces logits, which are unnormalized scores for each class. During training or inference, these logits can be passed through a softmax function to obtain class probabilities. The forward pass through the network is sequential, with each layer’s output serving as the input to the next, culminating in the final classification logits. A summary of layers used in the proposed BiLSTM model is presented in Table 8.

Table 8 Summary of BiLSTM Model Structure.

Transformers

The effectiveness of hybrid contextualized models in fine-grained text classification tasks, such as aspect category detection, has been recently demonstrated, reinforcing the role of transformers in different multiclass classification tasks59,60.

The experiments involving transformer-based models were carried out using BERT and RoBERTa models. For each language, a language-specific version of BERT and RoBERTa was utilized. Additionally, the multilingual BERT (mBERT) and XLM-RoBERTa were fine-tuned across all languages. This setup allows us to conduct three main types of comparisons within the transformer experiments. First, we can compare the performance of language-specific models against multilingual ones for each language. Second, we can examine the performance of multilingual models across the three languages. Furthermore, we can assess which language-specific model performed better for its respective language. Beyond these comparisons, this experimental design provides insights into the effectiveness of language-specific versus multilingual models. By analyzing the results, we can identify patterns and performance trends that may inform future model development and fine-tuning strategies. Additionally, these comparisons help us understand the nuances of each language’s impact on model performance, which could be valuable for tasks involving cross-lingual transfer learning and multilingual applications.

The exact types of transformer models with results are presented in Sect. “Transformers”. The transformer-based models from HuggingFace [https://huggingface.co/] were fine-tuned using the SimpleTransformers library [https://simpletransformers.ai/]. SimpleTransformers is designed to streamline the process of model setup, training, and evaluation61,62. For our experiments, fine-tuning was conducted over 15 epochs using 5-fold cross-validation. The models were trained with a maximum sequence length of 100 tokens and a learning rate of 3e_5.

Results

This section presents the results and comparisons of the performance of various experiments conducted on the proposed dataset. The average macro F1 score is used as the primary metric for comparing performances. Additionally, a more detailed analysis of the results from the best-performing experiments is provided in Sect. "Findings and analysis".

Traditional machine learning models

The English language results, outlined in Table 9, exhibit consistent trends across various traditional ML models in binary hope speech detection. SVM models with diverse kernels consistently outperformed their counterparts. Notably, the SVM (RBF) model, utilizing solely unigrams, achieved the highest F1 score of 0.8006. Conversely, models relying solely on bigrams or trigrams demonstrated inferior performance, with the optimal bigram configuration attaining an F1 score of 0.7026 with SVM (Linear). Performance notably improved upon combining unigrams with bigrams, yielding an F1 score of 0.7997 with SVM (Linear). Similarly, in the multiclass setting, unigrams exhibited superior performance, with SVM (Linear) achieving an F1 score of 0.5169. In contrast, bigrams and trigrams individually yielded substantially lower scores, with the combination of unigrams and bigrams resulting in an F1 score of 0.4983 with SVM (Linear).

Table 9 Results of traditional machine learning experiments with various n-grams in the English language.

On the other hand, Spanish language results detailed in Table 10, a comparable pattern emerges. Unigrams emerged as the most effective feature, with SVM (RBF) achieving a score of 0.7757. Analogous to the English findings, both bigrams and trigrams exhibited diminished performance, with the best bi-gram configuration attaining a score of 0.7008 with SVM (Linear). A marginal improvement was observed upon combining unigrams with bigrams, resulting in a score of 0.7779 with SVM (Linear). However, the multiclass Hope Speech detection task demonstrated a notable decline in performance, with unigrams and SVM (Linear) achieving 0.4896. Bigrams and trigrams similarly showed diminished scores, with the combination of unigrams and bigrams yielding 0.4701 with SVM (Linear).

Table 10 Results of traditional machine learning experiments with various n-grams in the Spanish language.

In German language, results depicted in Table 11, a parallel trend is evident, mirroring that of English and Spanish. In binary hope speech detection, unigrams paired with both SVM (Linear) and SVM (RBF) delivered the highest score of 0.8021. Conversely, bigrams and trigrams displayed inferior performance, with bigrams reaching a maximum of 0.7100 with SVM (Linear). A modest enhancement was observed upon combining Uni+Bi-grams, resulting in an improved performance of 0.8101 with SVM (Linear). Similarly, in multiclass hope speech detection, unigrams paired with SVM (Linear) proved most effective, achieving a score of 0.4180. Bigrams and trigrams registered lower scores, with the combination of n-grams (Uni+Bi-grams) slightly ameliorating performance to 0.4094 with SVM (Linear).

Table 11 Results of traditional machine learning experiments with various n-grams in the German language.

Deep learning models

Table 12 presents the average macro F1 scores of two DL models, BiLSTM and CNN, for binary and multiclass hope speech detection across English, Spanish, and German. In binary hope speech detection, CNN slightly outperforms BiLSTM in Spanish and German with average macro F1 scores of 0.7567 and 0.7847, respectively, indicating it captures relevant features more effectively in these languages. In contrast, BiLSTM performs marginally better at English, with an average macro F1 score of 0.7869.

On the other hand, for multiclass hope speech detection, BiLSTM outperforms CNN in English and German. Conversely, CNN slightly surpasses BiLSTM in Spanish. The results show that both models perform better in binary classification compared to multiclass classification, with the highest scores observed in English and an average macro F1 score of 0.5703 using the BiLSTM model.

Comparing their performance with traditional ML models shows a clear improvement in performance in a multiclass setting. However, in the binary task, despite overall surpassed results, the best results are still obtained with traditional ML models and n-gram features. Detailed comparison and analysis are provided in Section 7.

Table 12 Results of deep learning experiments.

Transformers

In this study, the performance of monolingual and multilingual versions of BERT and RoBERTa architectures were evaluated for binary and multiclass hope speech detection on the proposed datasets. Table 13 presents detailed results of these models.

In both binary and multiclass hope speech detection tasks, monolingual models generally outperform multilingual models in their respective languages. For instance, in the binary classification task, monolingual German BERT (bert-base-german-dbmdz-uncased) achieved the highest F1 score of 0.8704, compared to the multilingual XLM-RoBERTa model, which matched this score. Similarly, for English and Spanish, monolingual BERT models perform slightly better than their multilingual counterparts.

In most cases, RoBERTa models showed slightly lower performance than BERT models. Notably, the bertin-project/bertin-roberta-base-spanish model showed a considerable drop in Spanish performance in both tasks.

In the multiclass classification task, the performance drops across the board compared to the binary classification task, reflecting the increased difficulty of multiclass categorization. Additionally, the monolingual models also outperform the multilingual ones, but the margin is narrower than in the binary classification.

Interestingly, the XLM-RoBERTa model performed relatively well among multilingual models in both tasks, suggesting it might be better suited for cross-lingual tasks compared to mBERT. However, monolingual models still hold the advantage for specific language tasks, likely due to being fine-tuned on language-specific datasets.

These findings highlight that while multilingual models offer versatility across languages, monolingual models provide superior performance when tailored to individual languages. This insight is crucial for selecting appropriate models based on the specific requirements of a given task, especially in multilingual contexts.

Table 13 Results of transformers experiments.

Findings and analysis

Our proposed datasets for Spanish and German not only replicate the structure of PolyHope but expand its linguistic scope, enabling cross-cultural comparisons. This makes MIND-HOPE the first multilingual resource that facilitates fine-grained hope speech analysis beyond English. Compared to English-focused resources, these datasets offer deeper insights into how hope is communicated across languages, revealing important linguistic patterns that monolingual resources cannot capture.

The summary of the top-performing models for each learning approach can be found in Table 14. Across various languages and models, performance varies significantly. For instance, bert-base-german-dbmdz-uncased consistently demonstrates strong results in German, whereas xlm-roberta-base excels in English across both binary and multiclass detection tasks. Transformers generally outshine traditional machine learning and sometimes DL methods, particularly evident in binary hope speech detection tasks where xlm-roberta-base and bert-base-german-dbmdz-uncased consistently achieve high F1 scores across all languages.

Complex models like Transformers (e.g., bert, xlm-roberta) tend to yield superior outcomes compared to simpler models such as SVMs and CNNs, especially in tasks requiring nuanced comprehension like hope speech detection. While Transformers perform well overall, their efficacy can slightly vary depending on language and specific dataset characteristics, as observed across English, Spanish, and German datasets. Therefore, the findings indicate that Transformers, particularly models like bert-base-german-dbmdz-uncased and xlm-roberta-base, show promise for hope speech detection across multiple languages, delivering robust performance across diverse tasks compared to traditional ML and DL methods. A detailed analysis of each learning approach is discussed below:

Table 14 Best performing model based on each learning approach.
  • Traditional ML models:

    • Unigrams’ Superior Performance: Unigrams alone and combined with other n-grams consistently yield the highest F1 scores across all models and languages for both binary and multiclass tasks in all languages. This indicates that single-word features are highly informative for the classification tasks at hand.

    • Combination of N-grams: Combining unigrams with bigrams (Uni+Bi-grams) generally results in better performance compared to bigrams or trigrams alone. However, the improvement is not always significant and often does not surpass the performance of unigrams alone, especially for binary hope speech detection.

    • Multiclass Setting Challenges: The lower F1 scores observed in the multiclass setting suggest that distinguishing between different types of hope is inherently more challenging for traditional ML models with simple n-grams. This could be due to overlapping features among hope classes and an imbalance in class distribution.

    • Model Performance: SVM models, particularly with linear and RBF kernels, often outperform other models (LR, DT, RF, and MNB) in both Binary and Multiclass settings. This suggests that SVMs are well-suited for the feature space and nature of the dataset used in these experiments.

    • Language Consistency: The trends observed in the performance of different n-grams and models are consistent across English, Spanish, and German datasets. This suggests that the findings may generalize well across different languages with similar text classification tasks.

    • N-gram Limitations: While the combination of n-grams improves results slightly, it is evident that higher-order n-grams (Bigrams and Trigrams) alone do not provide significant additional information compared to Unigrams. This may be due to the sparsity and higher dimensionality associated with higher-order n-grams.

  • Deep learning models:

    • Model Performance Comparison: BiLSTM generally performs slightly better overall across both binary and multiclass hope speech detection tasks, particularly showing strength in English datasets. CNN, however, exhibits competitive performance, often performing better in Spanish and showing comparable results in German.

    • Language-Specific Performance: BiLSTM consistently outperforms CNN in both binary and multiclass tasks in English, suggesting inherent advantages for sequence-based models such as BiLSTM in this language. However, CNN demonstrates varied performance compared to BiLSTM, showing slight superiority in binary tasks for Spanish and marginal improvements in multiclass tasks for German. These findings highlight the nuanced impact of language on model performance.

    • Task Complexity Impact: Binary hope speech detection consistently yields higher accuracies compared to multiclass detection for both BiLSTM and CNN models across all languages. This is in line with the results of other learning approaches, too. This indicates that the simpler binary classification task is generally easier to model accurately than predicting multiple classes.

  • Transformers:

    • Monolingual vs. Multilingual Models: The monolingual models generally outperformed multilingual models in their respective languages, especially in binary classification tasks. However, in more challenging multiclass classification tasks, performance decreases for all models, and while monolingual models still have an edge over multilingual ones, the difference in performance is smaller. The superior performance of monolingual transformers can be attributed to language-specific fine-tuning and pretraining on domain-relevant corpora. These models benefit from richer contextual representations and vocabulary coverage tailored to the target language. In contrast, multilingual models often trade off representation quality for broader cross-lingual coverage, which may reduce effectiveness in single-language tasks63,64.

    • BERT vs. RoBERTa Models: RoBERTa models generally perform slightly worse than BERT models. Specifically, the bertin-project/bertin-roberta-base-spanish model shows a significant decrease in performance when applied to tasks in Spanish, both in binary and multiclass classifications.

    • Multilingual Model Performance: XLM-RoBERTa model performs well across multilingual tasks, suggesting it may be more effective for cross-lingual applications compared to mBERT. Despite this, monolingual models still excel in tasks specific to individual languages, benefiting from fine-tuning on datasets tailored to those languages.

Class-wise performance analysis

Table 15 provides class-wise precision, recall, and F1-scores for the best-performing models in both Spanish (xlm-roberta-base) and German (bert-base-german-dbmdz-uncased) across binary and multiclass classification tasks. In the binary setting, both languages exhibit strong and balanced performance across the two classes, with German slightly outperforming Spanish.

In the multiclass setting, performance becomes more variable. In both languages, Not Hope and Generalized Hope consistently receive the highest F1-scores (above 0.79 in German and 0.73 in Spanish), indicating that these two categories are more easily distinguishable by the models. This aligns with their clearer semantic boundaries and higher frequency in the dataset. In contrast, the Realistic Hope and Unrealistic Hope classes suffer from lower performance, with German yielding F1-scores of 0.61 and 0.46, and Spanish slightly outperforming German in Unrealistic Hope (0.60 vs. 0.46), while lagging behind in Realistic Hope (0.53 vs. 0.61).

These discrepancies reflect the challenges posed by low-resource classes and semantic ambiguity. For example, Unrealistic Hope is particularly difficult for both models, likely due to its nuanced linguistic expression and limited annotated examples. Interestingly, although German models benefit from higher agreement scores among annotators, Spanish models perform slightly better in capturing subtle expressions of unrealistic hope, suggesting potential differences in how such emotions are expressed across languages or how effectively the model tokenizers capture those nuances.

This fine-grained breakdown confirms that model performance is shaped not only by class imbalance but also by language-specific linguistic features and the granularity of class definitions. These findings reinforce the need for targeted strategies, such as class-aware loss functions, hierarchical modeling, or prompting methods, particularly to improve performance on subtle or underrepresented categories.

Table 15 Class-wise scores using best-performing models for Spanish (xlm-roberta-base) and German (bert-base-german-dbmdz-uncased).

Limitations

Although our primary goal was to establish and evaluate the Spanish and German hope speech datasets, with the aim of providing a foundation for future multilingual and fine-grained emotion classification tasks, there are certain methodological aspects and enhancements that were not addressed in the current work to maintain the focus and scope of the study. These limitations offer potential avenues for future work:

  • Label imbalance in the datasets: One notable limitation of this study is the imbalance in label distribution across the datasets, as shown in Figure 1. In particular, the Realistic and Unrealistic Hope categories are underrepresented compared to Generalized Hope and Not Hope. While we used stratified 5-fold cross-validation to ensure representative sampling during training and evaluation, we did not apply specific strategies such as oversampling, undersampling, or class weighting to counteract this imbalance. This decision was made to maintain comparability with prior work. However, we acknowledge that such imbalance may affect classifier performance, particularly for underrepresented classes, and may contribute to lower macro F1 scores in multiclass settings. In future work, we plan to explore methods such as Synthetic Minority Over-sampling Technique (SMOTE), focal loss, and class-specific weighting in model training. These adjustments may help models better recognize rare classes and improve fairness and overall robustness in hope speech detection.

  • Large language models and prompt-based methods: Another limitation of this study is the exclusion of large language models (LLMs) such as GPT-4 and ChatGPT, which have shown impressive results in zero-shot learning (ZSL) and few-shot learning (FSL) settings. These models operate fundamentally differently from the supervised transformer-based models evaluated in this paper and require a separate evaluation setup. Due to space constraints and the focus on benchmarking the newly developed MIND-HOPE datasets with reproducible supervised techniques, we chose not to include LLM-based experiments in this study. However, we recognize the promise of prompt engineering and ZSL/FSL with LLMs and plan to investigate these approaches in future work to further improve hope speech detection, especially in cross-lingual and low-resource scenarios.

  • Annotation ambiguity and category overlap: While the annotation process achieved substantial agreement (Cohen’s K = 0.78 for Spanish and 0.85 for German in binary; Fleiss’ K = 0.71 for Spanish and 0.81 for German in multiclass), the relatively lower agreement in the multiclass setting reflects the inherent difficulty of distinguishing between nuanced categories like Realistic Hope and Unrealistic Hope. Interestingly, the best-performing monolingual transformer model also showed comparatively lower F1 scores for these underrepresented and conceptually adjacent classes. This alignment between human and model confusion suggests that some category boundaries may be ambiguous even for expert annotators. Future work may explore revising or merging categories, providing more concrete annotation examples, or applying probabilistic or fuzzy labeling to better capture annotator uncertainty.

Conclusion and future work

This study contributed significantly to the field of hope speech detection through several key advancements. The development of the first multiclass hope speech detection datasets for Spanish and German represents a notable advancement, addressing a critical gap in the literature. These datasets facilitate the analysis and modeling of hope speech in languages beyond English, thereby supporting more comprehensive research and practical applications in diverse linguistic contexts.

A comprehensive review of existing datasets and techniques was provided, which could be a valuable resource for understanding the current state of hope speech detection. This review highlighted both recent advancements and existing limitations, offering insights into the evolution of methodologies and identifying areas for further research. By synthesizing current knowledge, the review contributed to a clearer understanding of effective techniques and underscored the need for continued innovation.

Additionally, we presented comparative analyses of various learning approaches, including traditional ML, DL, and Transformer models, that revealed important findings about their relative efficacy. Transformers generally outperformed traditional ML and DL methods, especially in binary classification tasks, indicating their robustness in capturing nuanced patterns of hope speech. However, transformers demonstrated certain advantages in multiclass classification scenarios, providing valuable guidance for selecting appropriate methodologies based on specific task requirements.

Furthermore, the evaluation of multilingual transformers against monolingual models further highlighted the strengths and limitations of different approaches. While multilingual models such as XLM-RoBERTa show promise in cross-lingual tasks, monolingual models typically achieve superior performance for language-specific tasks. This insight is crucial for selecting models tailored to specific languages, emphasizing the need for both multilingual versatility and monolingual precision.

In future work, we plan to organize a shared task based on the MIND-HOPE datasets to foster broader participation and standardized evaluation. Additionally, we will explore the use of large language models (LLMs) such as GPT-4 and multilingual instruction-tuned models for hope speech detection in ZSL and FSL settings. This includes designing and evaluating handcrafted and template-based prompts, comparing LLM outputs to supervised baselines, and assessing generalization in low-resource and cross-lingual scenarios. These steps aim to extend the utility of MIND-HOPE into the space of prompt-based learning and instruction-following models.