Introduction

Over the past three years, the Corona Virus Pandemic has swept across nations, leaving in its wake substantial damage to both public health and collective sentiment. The World Health Organization (WHO) deemed this crisis as one of the most severe public health emergencies in recent human history, a distressing indicator of its unprecedented scale.

In response to this pandemic, individuals worldwide turned to various social media platforms including Twitter, Facebook, and Quora to share their personal experiences, opinions, and emotions. These digital footprints etched across online spaces offer poignant snapshots of public life during the pandemic, carrying immense value not only for understanding the immediate crisis but also for fortifying our preparedness against potential future emergencies.

Numerous studies have delved into the realm of social media analysis amidst epidemic crises, exploring diverse facets such as the prevalence of negative emotions (Boon-Itt et al., 2020), the shaping of country image (Chen et al., 2020), manifestations of racism (He et al., 2021), expressions of nationalism (Bieber, 2022; Elias et al., 2021), propagation of hate speech (Wu et al., 2022), and more. These investigations have provided invaluable insights into the evolving social landscape during distinct time periods.

Furthermore, an array of methodologies has been employed to dissect the vast expanse of big data, encompassing approaches such as a mix of Machine Learning techniques (Garcia, 2020), and semantic network analysis (Ho, 2022), among others. While these diverse methods have significantly contributed to the field, it is noteworthy that the majority of them have concentrated either on the refinement of methodological approaches (Dey et al., 2001), (Rosa et al., 2015), (Yu et al., 2017), (da Rosa et al., 2016), or the application of pre-existing data mining techniques (Naseem et al., 2021). These advanced methodologies have paved the way for intelligent data processing within the realm of social media and have found extensive utility within the domain of computational communication science.

However, there are two notable challenges remain in comprehending the dynamic evolution of online discourse and the ever-shifting contours of public sentiment of the pandemic. Firstly, as the pandemic perpetually mutates and maintains a continuous grip on global affairs, the discussions unfolding in the digital sphere are characterized by remarkable swiftness and changeability. The inherent dynamism of these discussions warrants thorough investigation, as the temporal factors and the driving force behind this present study should be considered. Secondly, a persistent challenge remains in effectively harmonizing computational and quantitative methods with qualitative approaches (Zhao, 2023). The former specializes in unearthing patterns and laws within the data but often falls short when it comes to interpreting results and constructing meaningful narratives. Conversely, the latter excels in theoretical explanation but heavily relies on empirical observational data. These two methodological paradigms possess complementary strengths, yet they are seldom seamlessly integrated within a single research endeavor.

This paper addresses the aforementioned challenges and makes contributions in two areas:

(1) A Theoretical Foundation for deciphering nationalist rhetoric and collective perceptions during the pandemic by an evolving discourse perspective. The frame theory of communication is incorporated to shape quantization results from three critical category: feeling, action, and identity. Results illustrate that as time changes from 2020 to 2022, the feeling category increases and the action decreases, while the identity category remains consistent.

(2) A Multifaceted Approach combining advanced quantitative techniques and theoretical qualitative analysis are developed. The quantitative techniques employ transformer-based sentiment/emotion analysis and Bert-based topic modeling to generate quantitative evidence, enriching the depth and breadth of the text analysis. In the quantitative aspect, the evolving discourse framework further examine and interpret the research findings, enhancing the understanding of the opinion dynamics of pandemic at play.

The structure of this paper is as follows: In “Related Work”, we present a comprehensive overview of related work. “Datasets and Methods” details the datasets and methods employed in our study. The quantative mining analysis results are expounded upon in “Quantitative Mining Results”. “Evolving Discourse Frame” delves into the intricate components of the evolving discourse framework. Finally, in “Discussion and Conclusion”, we draw our conclusions based on the insights gleaned from this study.

Related work

The internet influence on nationalism

Nations, despite their complexities, persist as potential sources of violent exclusion, resisting facile categorization into simplistic binaries of anti-national and pro-national stances (Machin, 2014). The elusive nature of nations defies easy definition, and it is precisely this indeterminacy that underscores their enduring significance in contemporary social discourse.

Scholarly discussions on nationalism have often intertwined with topics like racism, patriotism, or populism, leading to theoretical ambiguity and conceptual debates. Although these concepts are often perplexed, they have slit differences in notions. As one of the principal ideologies in the 19th century, racism is focused on the human entity itself, while nationalism loosely constructed faith that aligns with liberalism, conservatism, and socialism (Mosse, 1995). Either as a form of discrimination or a way of looking at men and women, which presented a total picture of the world, racism was never an indispensable element of nationalism, especially during movements. Patriotism is given symbols of national loyalty and implies indifference or hostility to the people of other nations. Critical notions focus on the nation-state, despite patriotism is much more than simple loyalty to a country or obedience to its leaders (Audi, 2009). Populism and nationalism both emphasize “the people” despite that “the people” in populism refers to ethnos rather than demos (Akkerman, 2003).

The nation-state remains the dominant context for political presentation, as nationalism often emerges as a consequence of state-building processes (Tilly, 1994). While Gellner (2015) underscore the connection between nationalism and the nation-state by arguing that the nation-state becomes the political form that best accommodates the requirements of an industrial society, providing a framework for the management of diverse populations through the imposition of a standardized national culture.

Classical nationalist theories highlight the role of elites in propagating a shared imagined community from the core to the periphery (Whitmeyer, 2002). Recent inquiries have begun to focus on the network attributes of nationalism, suggesting that order can emerge from chaos, and “national” behaviors may manifest without a conventional imagined community. By exploring how symbols, rituals, and narratives contribute to national identity, Calhoun (2002) regard nationalism involves the production and dissemination of cultural meanings, binding people within an imagined community.

Nationalism is defined by Anderson as an ideology that pertinent to public conception of community (Anderson, 2006). His idea of the nation is identified as an “imagined community,” although it was concerned in “an anthropological spirit”. This conception is constructed by the distinction between one nation and other nations. Thereby nationalism discourse is structured as in-and-out-relation.

With the advancement of technology, the “imagined nationalism” accessible through social media platforms can significantly influence or reshape individual sense of identity. The internet has thus become an informal public sphere where individuals can reconstruct a sense of community. Although concept of the “global village” developed by McLuhan is frequently invoked to argue that the internet promotes more global than national communities, research has increasingly shown that the internet acts more as a “re-embedding technology” than a “dis-embedding technology” (Eriksen, 2007). It can be employed to reinforce identities that have been marginalized or altered (Eriksen and Kress, 2006) or to disintegrate identity.

In the political sphere, nationalism typically perceives the world as structured into nations (Bieber, 2022), representing not only a political ideology but also a vital source of meaning and identity. The rise of internet-driven nationalism is often considered a driving force that has pushed society into a constant state of division (Ausserladscheider, 2019). This term refers to the phenomenon where the internet plays a pivotal role in fostering and amplifying nationalist sentiments, contributing to societal division. It acknowledges the transformative impact of online platforms in shaping and intensifying nationalist ideologies, leading to a persistent state of societal polarization and discord (Cowan, 2021; Hyun et al., 2014). As the Corona Virus Pandemic amplified ethnic and nationalist tensions, nationalist and nativist sentiments have surged alongside the ongoing global pandemic (Bieber, 2022; Elias et al., 2021). Research has also pointed out that many tweets during the pandemic reported the dissemination of racist, prejudiced, and xenophobic attacks against East Asians (Abd-Alrazaq et al., 2020).

In the modern media landscape, there is a discernible shift toward the era of nationalism. A significant characteristic of this transition lies in the increasing deterritorialization of power, culture, politics, and economies, while the territorial nature remains integral to nations. This transformation, driven by the widespread adoption of communication technologies, has presented new challenges to established nation-states by creating cross-over communities on one hand and constraining individual identities on the other. Consequently, integrating computational techniques to understand the evolving landscape of nationalism in the era of new media has become a challenging yet invaluable endeavor.

In this direction, this paper frames the online nationalism discourse during the pandemic based on the above theoretical basis. Examination of the nationalist discourse framework elucidates alterations in the constituents of action, emotion, and identity. Specifically, there is a discernible decline in the identity framework, juxtaposed with a concurrent ascent in the emotional framework.

Discourse and sentiment analysis

The term “discourse” is employed in diverse ways, either as a means for providing a language to facilitate communication or as a tool to define and construct “the objects of knowledge” (Jäger, 2001). Given the substantial influence of online public opinion, discourses not only shape individual attitudes towards oneself and others but also play a role in reshaping national identity. Within the expansive landscape of social media, Twitter house a wide array of individuals and organizations, contributing to opinion fragmentation and rapidly facilitating conditions for public debates. This establishes a discursive foundation for comprehending the divergence in social attitudes. Consequently, the participants in this study demonstrate heightened awareness towards “online discourses” as vehicles for the expression of opinions. The emergence of the Internet has created a platform for democratic interaction on various online subjects. Its inherent openness lays the foundation for democratic discourse, encouraging a more extensive and profound engagement of individuals in discussions related to public affairs (Witschge, 2008). These online discourses encompass the exchange of information, opinions, and ideas within the virtual realm, revealing facets intricately interwoven with power dynamics and cultural influences during the pandemic.

Twitter, in particular, serves as an informative source for gaining insights into how people respond to specific events and topics (Ahmed et al., 2019). During the past pandemic of corona virus, it served as a platform for “speaking out” or, at times, as a tool for generating social discord. It is worth noting that online platforms are also influenced by various power dynamics that can shape the course and outcomes of discussions. Participants’ actions on Twitter can be harnessed by other entities, especially media outlets or political figures with publicity-oriented agendas or propaganda intentions. Notably, L1ght, a company specializing in measuring online toxicity, reported a staggering 900% increase in hate speech towards China on Twitter in 2020. These records indicate palpable tensions linked to the pandemic, while the evolution of the discourse process remains fluid.

To quickly gain insight into public discussions, sentiment analysis is employed to analyze users’ opinions, attitudes, and behavioral responses to specific events or incidents based on available written language (Jana et al., 2020). Sentiment analysis in the context of COVID-19 has primarily been employed to guide health system intervention plans by monitoring and analyzing public emotions and areas of concern during the pandemic. It treats the entire body of text related to a topic as a fundamental unit of information, leveraging machine learning and other techniques to identify the emotions expressed by individuals (Tago et al., 2022). Several studies have applied sentiment analysis and topic modeling methods to analyze tweets related to the pandemic, thereby exploring public discourse and psychological reactions. For instance, Boon-Itt et al. (2020) conducted thematic and sentiment analysis of 107,990 pandemic-related tweets between December 13 and March 9, 2020, shedding light on public perceptions and emotions during the early stages of the outbreak. They identified the most significant keyword as “outbreak,” with predominantly negative sentiment surrounding the discovery of the coronavirus.

The prevailing consensus in most studies is that negative emotions dominated the initial stages of the pandemic, although public sentiments evolved over time (Tsao et al., 2021). Alanezi and Hewahi (2020) used K-means clustering to analyze the emotional patterns of tweets during the pandemic and explore the impact of social distancing measures. Naseem et al. (2021) categorized pandemic-related tweets into 12 themes and employed natural language processing (NLP) techniques to automatically identify data as positive, negative, or neutral. Garcia et al. (2021) examined COVID-19-related tweets in English and Portuguese from April 17, 2020, through August 08, 2020. They identified ten main pandemic-related topics in both languages, most of which conveyed negative sentiments. Despite a wealth of literature offering insights into public discourse during the pandemic, there remains a scarcity of studies that investigate this discourse from an evolving perspective.

To address the above challenge, this paper compiles and compares Twitter data across three years of the pandemic. Utilizing a mixed methodology that incorporates quantitative analysis and theoretical qualitative analysis, this study develops an evolving discourse framework to provide a comprehensive theoretical interpretation of online nationalistic discourses during the pandemic. By conducting a diachronic analysis of both network structure and key discourse networks, this study elucidates the transformations within the nationalist discourse framework amid the epidemic. It furnishes empirical substantiation for elements such as the emotional framework and action framework. Concurrently, this research contributes to the enhancement of computational methods applicable in theoretical domains.

Datasets and methods

Datasets

China-oriented datasets

The datasets contain three groups of tweets. The time period spans three years: (1) Mar. 23, 2020 to Jul. 03, 2020, a total of 1,042,627 tweets, as the period selection is based on a sudden raise of global hate speech since Mar. 2020, and a temporary respite at Jul., 2020 due to the global focus change towards the wide-spread virus dynamics, (2) Mar. 18, 2021 to Apr. 15, 2021, a total of 764,030 tweets, as the period selection is based on the similar reason to 2020, and (3) Mar. 19, 2022 to Jul. 19, 2022, a total of 1,077,888 tweets, as the time period is to align with the year 2020. A collection of keywords belonging to the combination of two sets: (a) China-related keywords {‘china’, ‘chinese’, ‘P.R.China’} and (b) COVID-19 keywords {‘covid-19’, ‘covid’, ‘covid19’, ‘coronavirus’, ‘corona’, ‘sars cov 2’, ‘ncov 2019’}. The detailed tweet fields including {‘Date(GMT)’, ‘UserId’, ‘UserScreenName’, ‘UserName’, ‘Text’, ‘Platform’, ‘Type’, ‘RetweetCount’, ‘FavoriteCount’, ‘URL’} are collected through Twitter API. After removing duplicates and incomplete items, 2,884,548 tweets are finally obtained. Three are 1,039,566 tweets for 2020, 763,930 for 2021, and 1,048,556 for 2022.

World Health Organization (WHO) official datasets

To explore the intricate relationship between infectious disease patterns and their corresponding online sentiments over time, this study compiles and utilizes official data from the World Health Organization (WHO). The dataset spans from January 2020 to September 2022, encompassing critical information on confirmed cases, deaths, and recoveries related to the pandemic. These datasets provide comprehensive records of the global pandemic evolution, detailing its impact on various countries and regions worldwide.

Methods

This paper employs two text mining techniques, namely sentiment analysis and topic modeling, to gauge the prevailing public sentiment and identify the predominant topics within tweet texts.

Sentiment and emotion analysis

To conduct sentiment and emotion analysis, we have employed the finely-tuned transformer model developed by CardiffNLP. CardiffNLP is a research group in Natural Language Processing at Cardiff University, who work on the theoretical and applied NLP such as lexical semantics, multilinguality, downstream and social NLP applications (Camacho-Collados et al., 2022). The finely-tuned transformer model developed by them has undergone pre-training using a diverse range of datasets from multiple sources (Barbieri et al., 2020, 2022; Rosenthal et al., 2019; Mohammad et al., 2018) and has been rigorously validated, proving its competitiveness and advanced capabilities in handling Twitter-specific corpora. The specific analytical steps in this paper are detailed below.

Step 1. Corpora preparation. In the initial phase, we meticulously prepare our corpora. This involves extracting the ‘text’ field from tweets, which serves as our primary corpus. The transformer model we utilize possesses the capacity to encode both word embeddings and position embeddings, thereby allowing us to retain most characters within the text, including URLs, emoticons, stop words, and more.

Step 2. Sentiment recognition. Moving forward, we delve into sentiment recognition. Here, our objective is to predict the sentiment of a given Twitter text and classify it into one of three labels: positive, neutral, or negative. For English texts, we leverage a model trained on the Semeval-2017 dataset (Rosenthal et al., 2019). Meanwhile, texts in languages other than English are analyzed using a model trained on the UMSAB datasets (Barbieri et al., 2022). Each sentiment label is associated with a score that falls within the [0, 1] range, reflecting its probability.

Step 3. Emotion recognition. The emotion recognition task involves classifying Twitter text into one of four emotional categories: anger, joy, sadness, or optimism. The emotional categories are originally provided by TweetEval, a well-regarded and unified RoBERTa-base architecture. To accomplish this, a model trained on the dataset from SemEval-2018, specifically Task 1: Affect in Tweets (Mohammad et al., 2018) is employed.

Step 4. Descriptive statistics. Finally, in this step, the outcomes of both sentiment and emotion analysis conducted in Step 2 and Step 3 are draw upon. The sentiment and emotion categories with the highest ratings are selected as the primary labels for each tweet. Then, the average scores are calculated to gain insights into the degree of dominance exhibited by these primary annotations.

Topic modeling

Bertopic, an advanced topic modeling technique, has garnered recognition for its robust performance across various benchmark datasets. This method leverages pre-trained transformer-based language models and a class-based variation of TF-IDF, as introduced by Grootendorst in 2022. Its particular strength lies in its ability to effectively analyze short text data. In this paper, Bertopic is employed to unearth latent topics within the Twitter dataset, following these key steps:

Step 1. Corpora preparation. This step involves extracting text from Twitter and meticulously removing stop words to enhance the quality of the dataset. The list of stop words is sourced from a repository on GitHub, made available by University College Dublin.

Step 2. Document embeddings. Bertopic harnesses the sentence-BERT (SBERT) framework to generate dense sentence embeddings using pre-trained language models. These embeddings serve as the foundation for clustering documents with similar semantic content. A hierarchical extension of DBSCAN known as HDBSCAN is employed to facilitate this clustering process. This model treats noise as outliers, enabling soft-clustering and thus preventing the grouping of unrelated topics.

Step 3. Topic representation. Each cluster that is generated is associated with a distinct topic. To gain insight into the composition of these topics, Bertopic employs a modified TF-IDF approach known as class-based TF-IDF. This formula is utilized to represent the significance of each word within a document and is expressed as follows:

$${W}_{t,c}=t{f}_{t,c}\,\log \left(1+\frac{A}{t{f}_{t}}\right)$$

where \({W}_{t,c}\) represent the weight of a term t in a class c. The class c is defined as a single document that containing all documents belonging to the corresponding cluster. A denotes the average number of words per cluster. \(t{f}_{t}\) is the frequency of term t across all clusters. \(t{f}_{t,c}\) represents the frequency of term t across the clusters c. The class-based TF-IDF measures the importance of words in a cluster and generates the topic-word distribution.

Quantitative mining results

Evolutionary trends

A noteworthy trend characterized by an escalating death rate and an increasing volume of tweets is shown in Fig. 1. Within this trend, seasonal variations are observed in the growth rates of cases. However, it is important to note that the number of new deaths does not exhibit a discernible seasonal pattern. Remarkably, when compared to the curve depicting new confirmed cases, the surge in tweet activity demonstrates more synchronized fluctuations with the rate of death increase.

Fig. 1
figure 1

Death toll growth versus social media mood swings.

Sentiment distribution

Text sentiment informs the implicit attitudes, standpoints, psychological motivations, or behavioral logic of posters. The sentiment distribution of all tweets are shown in Table 1. Results reveal that negative sentiment remained relatively consistent in 2020 and 2021 but exhibited a significant surge in 2022, reaching 43%. In contrast, the positive sentiment experienced a consistent uptrend, increasing from 2.9% in 2020 and 5.0% in 2021 to 11.1% in 2022. This shift suggests that, as the proportion of neutral sentiments declines, the proportion of polarized sentiments (positive and negative) is on the rise, with negative sentiments gaining prominence.

Table 1 The sentiment distribution of total tweets.

The minority of tweets in cyberspace often receive a significant amount of attention and are referred to as influential tweets. The sentiment distribution of these tweets are illustrated in Table 2. Influence is determined by tweets with the highest number of retweets (n = 1000) and favorites (n = 1000). The results indicate a significant decrease in neutral sentiment from 2020 to 2022, mirroring the trend observed in the overall sample. In the case of negative sentiment, there is a notable upward trajectory, rising from 26.0% to 44.1% for top retweets and from 28.7% to 56.2% for top favorites. These trends are steeper than the overall trend, which sees a shift from 30.6% to 43.0%. This suggests that highly influential tweets tend to exhibit more pessimistic sentiments than ordinary tweets.

Table 2 The sentiment distribution of tweets with top retweets and favorites (n = 1000).

Conversely, for positive sentiment, the trends are as follows: 4.7%, 11.2%, and 10.7% for top retweets, and 5.8%, 13.8%, and 12.9% for top favorites. These trends show a sudden increase in 2021, which is sustained in 2022. In contrast, the overall sample exhibits a gradual increase trend from 2.9% to 11.1%. This indicates that highly influential tweets tend to display positive trends earlier than ordinary tweets.

Emotion distribution

The emotions distribution of tweets from 2020 to 2022 is shown in Table 3. The results indicate a progressive decrease in the percentage of anger. However, sadness exhibits a sudden increase in 2022. Both anger and sadness are generally associated with pessimistic sentiments, and their overall percentages remain relatively stable between 2020 and 2022. In contrast, joy and optimism experience a peak in 2021 followed by a more substantial decline in 2022, suggesting that these two positive emotions are not as prominent in the overall emotional landscape.

Table 3 The emotion distribution of total tweets.

The emotional distribution of tweets with the highest retweet-count (n = 1000) and favorite-count (n = 1000) is shown in Table 4. The results reveal that for the top retweeted tweets, the combination of anger and sadness follows an evolving trend, decreasing from 57.0% to 45.3%, and then increasing to 70.7% from 2020 to 2022. Similarly, for top favorited tweets, this trend shifts from 63.7% to 44.3%, and then rises to 71.3%. Comparatively, the overall sample of tweets exhibits a somewhat less pronounced trend, with the combined anger and sadness emotions declining from 64.8% to 49.7% and then moderating to 69.2%. What is particularly noteworthy is that highly influential tweets, both top retweeted and top favorited, display more significant fluctuations in these emotions over the three years. This suggests that these influential tweets are more strongly associated with pessimistic sentiments than the ordinary tweets, reflecting the nuanced emotional dynamics in this online discourse.

Table 4 The emotion distribution of tweets with top retweets and favorites (n = 1000).

Topics modeling

Topic modeling offers insights into the evolving themes within public discussions. Selecting an optimal number of topics is not standardized, but for our analysis, we have chosen 15 topics with 25 n-terms in each topic. This choice was made as it provides the best interpretability for identifying core concepts within the discourse. We operate under the assumption that public discourse comprises interconnected concepts (Sutherland, 2005). Our goal is to depict the connections among the extracted topics. The output distribution of topics is showcased in Fig. 2, including clusters that contain meanings that are challenging to identify, labeled in green. We employ hierarchical clustering, a matrix-based cosine distance method embedded in Bertopic, to generate an inter-topic distance map while retaining essential words within topic descriptions. By clustering documents into semantically similar groups and subsequently creating topical presentations, we aim to quantitatively reduce the confusion stemming from scattered topics, thus enhancing interpretability. For example, in the hierarchical clustering_2020 illustrated in Fig. 2c, we observe that T0, T1, T5, T7, T9, and T13 exhibit closer topic distances, while they are relatively distant from the other topics.

Fig. 2: Topic hierarchical clustering across years.
figure 2

ac show the evolving topic clusters from 2020 to 2022. Topics in (a) are dispersed and relatively diverse. b shows highly clustered topics organized into two distinct communities. c illustrates a multipolar distribution of topics, characterized by high diversity.

To clarify the similarity scores between subjects, Fig. 3 presents them in a matrix. This figure indicates that the differentiation between topics has shifted over time, with 2020 featuring more divisive topics alongside some very similar ones. It also reveals that topic similarity increased during 2021. By 2022, the topic similarity distribution appears to be more uniform and exhibits a higher average similarity value, implying reduced topical diversity.

Fig. 3: Topic similarity across years.
figure 3

ac illustrate the evolution of topic similarities from 2020 to 2022. Topics in (a) are centralized and highly similar. b displays decentralized topics forming distinct communities. c shows topics that are fragmented and less distinct.

Evolving discourse frame

In this section, we introduce a qualitative theoretical analysis technique called the “Evolving Discourse Framework Analysis,” designed to unravel the complexities of online discussions throughout the pandemic. It is assumed that online discourses can be structured under the frame theory of communication, to gain insights into the underlying structures that shape public opinion, influence perceptions, and ideological sphere in the digital landscape. Notably, discussions on online platforms may also reflect hierarchies and power dynamics of human world, which refers to the scope of political sociology and deserve other specialized discussions.

Action, feeling, identity trends fluctuation subordinate to nationalism

Researchers have posited that nationalist discourse can be categorized into two primary components. The first category revolves around nationalistic feelings, encompassing elements such as consciousness, attitudes, expectations, and loyalty. The second category encompasses doctrines, including ideologies, programs, organizational activities, and movements (Smith and Hall, 2004). The first category highlights the expression of sentiments during the process, while the second category serves as a reference for action. Furthermore, nationalism revolves around the concept of affiliated identity. Referring to the nationalism discourse framework related to the Hong Kong protests (Ho, 2022), we can consider three critical aspects: threat, identity, and action.

By merging traditional nationalism theory with the insights from the aforementioned studies, we construct a discourse framework that incorporates feeling, action, and identity. To facilitate visualization, we illustrate their evolution in Fig. 4. To gain a better understanding of how these three frameworks perform with imbalanced data, we include “gram.” Gram for each category is calculated and normalized within the annual counts.

Fig. 4
figure 4

Nationalism discourse in topic cast (2020–2022).

As Fig. 4 depicts, the topics within the nationalism framework have evolved over the years. The “feeling” category exhibited a notable increase from 2020 to 2022, while the “identity” category appears to have weakened. The “action” category remains relatively consistent over the three years, with a slight uptick in 2022.

The shift in the identity curve can be attributed to the initially hostile attitudes directed towards nations and their citizens affected by the pandemic in its early stages. As the epidemic transformed into a global issue, the notion of identity linked to the disease became less distinct. Consequently, by 2022, the topic of identity had already diminished in significance, occupying a smaller proportion of the discourse.

The surge in the “feeling” category reflects the collective mindset of global citizens to some degree. Within the realm of feeling, themes of intimidation and insecurity emerge prominently through word clusters.

Frequency words under action, feeling, and identity framework

Conducting word frequency analysis offers a rapid means to discern the prevalent themes in public discourse and gauge the prevailing attitudes and emotions (Ho, 2022). In this study, we harnessed high-frequency word analysis over the course of three years and compared these findings with the high-weighted words in the topic clusters to trace the evolution of discourse.

Table 5 presents the top 20 high-frequency words in two distinct categories. To enhance thematic clarity, we have excluded stop words and search keywords such as “covid/COVID,” “China/china,” and “coronavirus/virus.” The results shed light on the evolving discourse patterns.

Table 5 Comparison of top freq. words in tweets and in topic clusters (n = 20).

In 2020, the public discourse predominantly emphasized nations (e.g., “us,” “Chinese,” “American,” “Europe,” “Japan,” “Italy”), which symbolized a strong emphasis on identity. This period was characterized by a pervasive sense of anxiety and insecurity. In 2021, the term “vaccine” was repeatedly mentioned, reflecting the public collective struggle to find a solution through action. High-frequency words like “influence” and “protest” also underscored the theme of action. By 2022, urban-level lockdowns had garnered global attention, with public discourse increasingly focusing on the “zero-covid” strategy. Emotional language became more poignant during this time.

These findings validate the upward trajectory of the “feeling” framework in terms of vocabulary composition. In contrast, the “identity” framework waned significantly in 2022, while the “action” framework remained prominent throughout the three years. These insights offer a glimpse into the discursive evolution of nationalism, as reflected in the language used in public discourse, highlighting the need for further analysis to explore the interconnections between these evolving words and concepts.

Diachronic comparison of co-occurrent network under three frames

The preceding analysis has provided insights into core topics and clusters of high-frequency words. In the realm of discourse analysis, the network visualization of co-occurring high-frequency words aids in comprehending how public discourse is structured. In Fig. 5, high-frequency words are extracted by initially segmenting tweet text into words, then eliminating stop words, and ultimately organizing them by frequency. These high-frequency words are treated as nodes in the network, with the co-occurrence of words within a common sentence forming edges. The entire network is visualized using Gephi software.

Fig. 5: Co-occurrence of high-frequency words.
figure 5

ac show the change of words connection from 2020 to 2022. In (a), notable words include those related to nation-state identity and pandemic-related hostility. b emphasizes words associated with actions. c shows an increase in words related to global pandemic prevention.

Statistical examination of the lexical association network revealed that the graph density of the keyword network increased over time (D_2020 = 0.866, D_2021 = 0.767, D_2022 = 0.919), while the degree of modularity decreased (M_2020 = 0.475, M_2021 = 0.280, M_2022 = 0.146). This implies that key concepts discussed in public conversations became more diverse and fragmented in 2022.

Analyzing the connections between words unveils shifts in public attention across the years. Two word-clusters stood out in 2020: one centered on nation-state identity, and the other on pandemic-related hostility and calls for punishment. In 2021, discourse reflected the consequences of actions, with topics such as protests taking prominence. By 2022, specific cities gained attention, signaling that the pandemic had evolved into a global phenomenon, and the dominant role of individual nations had begun to wane. This shift in the discourse network outlined a changing discourse landscape amid the pandemic.

Concrete emotional combinations under the three frames across years

The analysis conducted at the lexical level uncovers a potential discourse framework of nationalism comprising national emotions, actions, and identity. To ensure a comprehensive understanding without distortion, complete context is essential.

By combining the nationalism framework with thematic analysis and text examination, we identify that the framework of national emotions encompasses elements like perceived threats, expectations, loyalty, as well as human emotions such as fear, worries, sympathy, and anger. The action framework encompasses programs, movements, strategies, and concerns, while the identity framework incorporates nationality, race, location, alignment, and ideological affiliations. Following these principles, we select the top 50 high retweets for each year, totaling 150 high retweets. These were manually encoded by three trained coders using separate coding, comparison, and correction. For texts that fall into multiple categories, their significant concerns were evaluated.

The distribution of emotions within this discourse framework over the course of three years is visualized in Fig. 6. This visualization highlights that sadness remains the prevailing emotion across all time periods. In terms of emotional distribution, anger is more prominent in 2020 (n = 23), significantly decreases in 2021 (n = 5), but resurges in 2022 (n = 14). On the other hand, optimism experiences an increase starting from 2021, although the overall counts remain relatively low.

Fig. 6
figure 6

Evolution of discourse framework with emotions.

Drawing insights from the distribution of highly retweeted topics within the framework, notable trends emerge. Within the action framework, the discourse focus shifted: in 2020, there was heightened attention on programs (n = 17), followed by a broader range of concerns in 2021 (n = 5), and a simultaneous increase in discussions about both programs (n = 13) and strategies (n = 9) in 2022.

Turning to the feeling framework, a significant upsurge in sadness is evident in 2021 (n = 13), indicating a collective emotional response. Notably, sympathy experiences an increase in 2022 (n = 5), implying a growing sense of empathy, potentially signifying global recovery efforts in the making.

Within the identity framework, nationality emerges as a pivotal discourse point. Regrettably, discussions centered around nationality often carry undertones of anger. A more focused examination of these specific discourses unveils that nationality is frequently targeted in critical tweets, particularly during the initial stages of the epidemic. Furthermore, identity is framed not only as a marker of nationality but also as an indicator of alignment, including ideological affiliations.

This segment of analysis delves into the perspective of framework-emotion within the top 50 highly retweeted tweets. These widely shared tweets hold significance in terms of information dissemination within the realm of social networks.

Discussion and conclusion

This paper has developed a comprehensive study based on around 2.65 million tweets during the corona virus pandemic. A quantitative analysis, including sentiment & emotion analysis, Bert-based topic modeling techniques, and network analysis, provides rich evidence to observe the online public opinion. A qualitative analysis is developed to dissect the discourse framework of nationalism and observe the evolving trends of the discourses over the three years of the pandemic. The major findings are four aspects:

(1) Topic modeling results reveal the changes in the action, emotion and identity elements in the composition of nationalist discourse: the identity frame is declining, the emotion frame is rising, and the action frame basically keeps consistent.

(2) High-frequency keywords reflect diachronic changes in public discourse under three frames: Two phrases stood out in 2020: one centered on nation-state identity, the other centered on pandemic-related hostility and calls for punishment. In 2021, rhetoric reflects the consequences of actions, and topics such as protests become prominent. By 2022, specific cities have received attention, indicating that epidemic control measures are differentiated, with actions in some regions receiving more attention.

(3) The keyword network reflects the connections and changes between key concepts, and the overall graph density increases over time. Compared with 2020 and 2021, the key concepts discussed in public dialog in 2022 have become more diverse and fragmented.

(4) The sentiment analysis results reflect the changes in sentiment under the three frameworks. The affective framework contains elements such as perceived threat, expectation, and loyalty, as well as human emotions such as fear, worry, sympathy, and anger. Action frames include concerns such as plans, campaigns, strategies, etc., while identity frames include nationality, ethnicity, location, alliances, and ideology.

Our findings illuminate an evolving public discourse framework over time, which can be elucidated through the lenses of the identity framework, action framework, and feeling framework. These discoveries offer a nuanced understanding of how public discourse morphs during large-scale global events and may shed light on the emerging norms brought forth by epidemics. The heightened focus on national identity, responsibility, and solidarity during the pandemic offers a unique context for understanding how online discussions are shaped by and contribute to the broader discourse on nationalism. Furthermore, the pandemic has accentuated the role of online platforms as crucial sphere for the construction and negotiation of national narratives.

This study employs a mixed approach, amalgamating text mining with qualitative analysis. By merging theoretical frameworks from the social sciences with computational techniques, we provide a scientific elucidation for tracking the evolution of public sentiment. We systematically categorize and quantify individual expressions, encompassing sentiment, emotions, topics, and salient expressions within tweets. In contrast to the traditional focus on the relationships between variables, our approach is inspired by framing theory from social science research, as we synthesize key elements from the theoretical framework of nationalism for a comparative analysis of tweets spanning the epidemic course. Our findings not only enrich the analysis of social opinions related to the past pandemic over different time periods but also demonstrate the practical application of this hybrid method, combining computational and qualitative text mining.

As an exploratory study, we did not impose a predefined framework for public discourse but aimed to discern common norms from empirical data. Nonetheless, a theoretical foundation remains crucial. Our analysis centers on the evolution of public discourse, necessitating controlled datasets across time. Our work has the potential to unveil some of the elements that divide our world while also exposing emotions that bind it together.

Throughout this study, we have consistently observed elements of hatred and sadness on Twitter, coexisting with negative sentiments during the epidemic. By merging computational and quantitative analyses, we endeavor to decode the unstructured content generated on social media, encompassing sentiments, emotions, potential topics, and specific utterances. Viewing this through the lens of the nationalism discourse framework, our research offers a diachronic reference for the study of public opinion during the epidemic.

However, there are two principal limitations in this study. First, we employ unsupervised machine learning to assess sentiment and emotions. While this method efficiently processes data, errors are inevitable, especially when dealing with unstructured data, making corrections challenging. Second, the mixed approach we introduce in this study relies on pre-established social theories. This might pose interdisciplinary challenges when applied to different datasets.

Correspondingly, there are two promising directions for future research. Firstly, the integration of advanced natural language methods such as self-supervised learning and Artificial Intelligence Generated Content (AIGC) models. Self-supervised learning involves designing auxiliary tasks to uncover inherent features of textual data as supervised signals, thereby enhancing the model ability to extract meaningful information. ChatGPT and other AIGC models, trained on large-scale data samples, can effectively manage error rates through annotated training data and prompt engineering. Secondly, it is crucial to explore a comprehensive mixed-methods framework suitable for social sciences, which combines innovative computational approaches with compatibility with social science theories. This necessitates research teams to possess interdisciplinary foundations for their investigations.