Reading the climate room through unsupervised analysis of unfiltered climate perspectives

Sweeney, Lorin; Mehrotra, Rabhya; Saintraint, Fionna; Brennan, Robert A.; Suiter, Jane

doi:10.1038/s41598-026-44553-x

Download PDF

Article
Open access
Published: 24 March 2026

Reading the climate room through unsupervised analysis of unfiltered climate perspectives

Lorin Sweeney¹,
Rabhya Mehrotra¹,
Fionna Saintraint¹,
Robert A. Brennan¹ &
…
Jane Suiter¹

Scientific Reports volume 16, Article number: 14828 (2026) Cite this article

2107 Accesses
4 Altmetric
Metrics details

Subjects

Abstract

Understanding how competing perspectives shape public perception of climate change is critical, yet existing studies are limited by predefined categories, manual coding, or methodologically constrained topic models. We address this gap by introducing the first large-scale, publicly available dataset of climate discourse, consisting of systematically curated advocate (21,513 documents) and skeptic (26,930 documents) corpora. To facilitate rigorous comparative perspective analysis, we propose an innovative semantic chunking algorithm that segments documents into coherent, semantically meaningful units. Our unsupervised, inductive, exploratory approach reveals distinctive rhetorical strategies: advocates emphasize pragmatic solutions and crisis framings through fear and sadness appeals, while skeptics deploy anti-elite rhetoric and emotional appeals rooted in anger and disgust. These openly accessible resources provide a scalable methodological foundation enabling researchers across psychology, political science, and communication studies to systematically investigate climate discourse dynamics and other contentious societal perspectives.

Machine learning based attribution mapping of climate related discussions on social media

Article Open access 08 November 2022

A temperature check on climate communication: where are we?

Article Open access 28 February 2025

A toolkit for understanding and addressing climate scepticism

Article 16 November 2022

Introduction

Climate perspectives strongly influence public perceptions and political engagement around environmental challenges¹. These perspectives integrate events into societal worldviews, shape attitudes towards policy, and drive or inhibit collective actions. Given their centrality, accurately identifying and analyzing climate perspectives is critical. However, despite the importance of rhetorical analyses in climate discourse^2,3, current methods face limitations that hinder systematic comparative research at scale.

Traditionally, climate rhetoric studies have employed deductive coding methods, using predefined frameworks shaped by expert judgment^4,5. While insightful, these approaches inherently restrict analysis to anticipated categories, potentially missing emergent patterns or unexpected rhetorical strategies⁶. Moreover, deductive coding remains labor-intensive, reducing reproducibility and scalability, particularly for large or evolving corpora.

Inductive computational methods have emerged as alternatives aiming to overcome these limitations. Probabilistic topic models represent historically prominent inductive methods⁷. Yet, despite their nominally unsupervised classification, probabilistic methods require significant manual intervention: researchers must pre-select topic counts, iteratively refine model parameters, and subjectively interpret resulting topics, thus introducing biases and substantially limiting reproducibility and objectivity^8,9. These interpretative and parameter-tuning demands undermine claims of genuinely unsupervised or scalable analysis¹⁰. A detailed methodological critique comparing probabilistic and embedding-based approaches is provided in Supplementary Information S1.

Recently, embedding-based methods (e.g., BERTopic, Top2Vec) employing transformer-based language models offer a promising alternative. These methods provide contextual embeddings reflecting nuanced semantic relationships, and clusters emerge organically without predefined topic counts or extensive manual interpretation, significantly enhancing reproducibility and interpretability^10,11. Despite their advantages, such methods remain underutilized in climate communication research.

Recognizing these methodological gaps and the absence of openly accessible, systematically curated datasets, we developed two resources explicitly designed to facilitate rigorous, scalable, and inductive comparative analyses:

1. A systematically curated, publicly available dataset of climate perspectives: This dataset includes 21,513 advocate articles identified from recognized climate advocacy platforms and 26,930 skeptic articles adapted from a validated source list⁵. Articles were systematically scraped and filtered for length, linguistic quality, and redundancy, with noise and repetitive boilerplate content iteratively minimized through embedding-based filtering (complete corpus selection and cleaning details are documented in S3).

2. An embedding-based semantic segmentation algorithm: Conventional text segmentation methods (e.g., fixed-length or sentence-based segmentation) often fail to capture the fluidity and coherence of rhetorical perspectives. To overcome this, our adaptive method employs transformer-based embeddings, dynamically grouping semantically cohesive sentences through cosine similarity thresholds and gradient-based optimization. Bayesian hyperparameter optimization ensures consistent, reproducible segment quality (algorithm details, pseudocode, and optimization procedures are fully described in Methods and S3).

Applying these resources in illustrative, exploratory analyses revealed distinctive rhetorical strategies between advocates and skeptics. Advocate perspectives predominantly emphasize pragmatic policy solutions to climate challenges, frequently utilizing emotional appeals based on fear and sadness to underscore urgency and motivate collective action. Skeptics, conversely, heavily deploy anti-elite populist rhetoric, portraying climate actions as elite-imposed injustices, consistently using anger and disgust to discredit mainstream climate science and policies. Such findings exemplify the analytical utility and clarity afforded by our approach, capturing nuanced rhetorical distinctions systematically and reproducibly.

Importantly, these resources offer methodological utility far beyond climate discourse. Researchers investigating vaccine hesitancy, political polarization, or misinformation can directly apply the openly available dataset and segmentation method, facilitating broad interdisciplinary insights into rhetorical structures, framing, and societal influences at previously unattainable scales.

By addressing fundamental methodological limitations and providing transparent, reusable resources, we enable systematic comparative research, informing effective, evidence-based communication and policy interventions in polarized contexts.

Results and analysis

The primary aim of the current work is to demonstrate a novel computational methodology designed to provide unsupervised perspective analysis and feature extraction for further investigation. A wide range of linguistic features in climate communication could be examined using this approach. As illustrative examples, we performed exploratory analyses that compared several prominent linguistic aspects of advocate and skeptic climate perspectives. Informed by relevant climate communication literature, these linguistic aspects were populist rhetorical features, emotive appeals, problem vs. solution orientations, frames, and expressed emotions (distinct from emotive appeals). The analysis proceeds in two complementary stages: a group-level overview, followed by a fine-grained topic-level comparison. Topics were derived using our clustering and topic modeling pipeline (see section 4.9). We identified the 24 most similar topics across skeptic and advocate sources by calculating a cosine similarity matrix for each topic pair combination. Focusing on these high-similarity pairs provides a targeted basis for comparative analysis (for the full list, see Table S10). Extended theoretical context, methodological explanations, and additional statistical details (Tables S3 - S14) are provided in Supplementary Information S1 & S2.

Group-level analysis

Aggregated results across the advocate (125,407 chunks) and skeptic (80,817 chunks) corpora of matched topics revealed distinct rhetorical and emotive patterns (Fig. 1).

Populist rhetorical features. Skeptics used anti-elite rhetoric more frequently (20.2%) than advocates (8.6%; $Z=85$, $h=0.35$, $p<0.001$). Conversely, crisis-framing was more prevalent in advocate texts (17.6%) than skeptic texts (11.4%; $Z=-29$, $h=-0.21$, $p<0.001$). People-centrism appeared slightly more frequently in advocate perspectives (8.7%) compared to skeptics (6.4%; $Z=-15$, $h=-0.09$, $p<0.001$), while simplistic solutions were marginally more common in skeptic discourse (3.5%) compared to advocates (2.0%; $Z=8$, $h=0.1$, $p<0.001$).

Emotive appeals. Advocate content exhibited significantly greater appeals to fear (12.6%) than skeptic content (9.4%; $Z=-14.6$, $h=-0.12$, $p<0.001$) and sadness (advocates: 5.5%; skeptics: 3.3%; $Z=-17.8$, $h=-0.10$, $p<0.001$). Skeptics appealed significantly more to anger (10.6% vs. advocates: 7.1%; $Z=24.7$, $h=0.16$, $p<0.001$) and disgust (skeptics: 3.2%; advocates: 1.9%; $Z=8.8$, $h=0.09$, $p<0.001$). Differences for surprise and happiness were smaller yet statistically significant (surprise: skeptics 0.63%, advocates 0.32%; $Z=8.4$, $h=0.04$, $p<0.001$).

Narrative framing styles. Skeptics more frequently employed scientific/technical framing (71.1%) compared to advocates (65.9%; $Z=21.5$, $h=0.11$, $p<0.001$). Advocates favored pragmatic/economical framing substantially more (22.0%) than skeptics (12.3%; $Z=-49.1$, $h=-0.23$, $p<0.001$). Moral/ethical framing was slightly more common in advocate discourse (6.5% vs. skeptics: 5.6%; $Z=-8.2$, $h=-0.04$, $p<0.001$), whereas skeptics relied more frequently on ideological/emotive framing (5.2% vs. advocates: 2.4%; $Z=38.5$, $h=0.18$, $p<0.001$).

Problem-solution orientations. Advocate discourse emphasized explicit solutions considerably more (15.6%) than skeptic discourse (6.9%; $Z=50.9$, $h=0.2$, $p<0.001$). Conversely, skeptics consistently emphasized problems or barriers more frequently across all texts (skeptics: 32.7%; advocates: 28.1%; $Z=-34.2$, $h=-0.10$, $p<0.001$).

Expressed Emotions Beyond strategic emotional appeals, analysis of expressed emotions also revealed significant differences ($\chi ^2=79.21$, $p<2.2\times 10^{-8}$, $df=22$). Advocate texts explicitly expressed higher proportions of positive emotions such as approval and optimism, alongside emotions indicating concern or vulnerability (e.g., nervousness), while skeptic texts significantly expressed disapproval and criticism more frequently (see Table S9 for full details.)

Topic-level analysis

To explore nuanced contextual differences beyond aggregate trends, we conducted a detailed comparative analysis across the 24 most closely matched topic pairs between advocate and skeptic corpora for the same four groups of features.

Populist rhetorical features. Significant variation emerged in the use of populist rhetoric by topic (Fig. 2). Anti-elite rhetoric was notably emphasized by skeptics within ecological topics, particularly in Pollinator Health & Biodiversity Loss (skeptic proportion 29.8%, advocate proportion 7.2%; $Z=14.2$, $h=0.79$, $p<0.001$), as well as in marine ecosystem contexts such as Ocean Acidification & Coral Reef Health (skeptic 26.1%, advocate 5.3%; $Z=11.7$, $h=0.72$, $p<0.001$). Crisis-framing was substantially more prevalent in advocate perspectives addressing acute climate-related threats, exemplified by Australian Bushfires & Climate Extremes (advocate 41.2%, skeptic 8.7%; $Z=-13.4$, $h=-0.77$, $p<0.001$) and Extreme Weather & Climate Refugees (advocate 33.4%, skeptic 9.1%; $Z=-10.8$, $h=-0.63$, $p<0.001$). People-centrism showed smaller yet statistically robust differences, notably higher in advocate content regarding Wildfire Smoke & Public Health (advocate 15.2%, skeptic 5.9%; $Z=-6.1$, $h=-0.32$, $p<0.001$). Skeptics demonstrated slightly greater reliance on simplistic solutions, particularly within technologically focused topics, including Geoengineering & Negative Emissions (skeptic 11.4%, advocate 3.3%; $Z=4.6$, $h=0.22$, $p<0.001$).

Emotive appeals. Emotive appeals also varied significantly by topic (Fig. 3). Advocate texts notably emphasized fear in contexts highlighting severe or immediate risks, especially in Extreme Weather & Climate Refugees (advocate 22.1%, skeptic 4.6%; $Z=-10.1$, $h=-0.51$, $p<0.001$), and sadness prominently in Australian Bushfires & Climate Extremes (advocate 11.7%, skeptic 3.7%; $Z=-5.8$, $h=-0.29$, $p<0.001$). Skeptic content demonstrated a significant reliance on anger and disgust appeals in policy-contested or scientifically controversial topics. This pattern was strongest in Ocean Acidification & Coral Reef Health (anger: skeptic 17.1%, advocate 3.6%; $Z=8.8$, $h=0.63$, $p<0.001$; disgust: skeptic 6.3%, advocate 2.5%; $Z=3.5$, $h=0.25$, $p<0.001$) and also notable in topics such as Wildfires: Impacts, Dynamics & Climate Drivers (anger: skeptic 14.5%, advocate 6.0%; $Z=7.3$, $h=0.42$, $p<0.001$). Smaller but statistically significant differences were observed for surprise (e.g., Pollinator Health & Biodiversity Loss, skeptic 1.2%, advocate 0.3%; $Z=2.6$, $h=0.19$, $p<0.01$).

Narrative framing styles. Topic-specific variations in framing styles were similarly robust (Fig. 4). Pragmatic/economical framing was significantly more common among advocates within technology and policy-related contexts such as Electric Vehicles & Hybrid Transport (advocate 38.4%, skeptic 8.4%; $Z=-9.4$, $h=-0.68$, $p<0.001$). Skeptics favored scientific/technical framing across multiple contexts, notably in debates over scientific uncertainty, exemplified by Climate Debate: Denialism & Consensus (skeptic 78.2%, advocate 70.1%; $Z=4.0$, $h=0.19$, $p<0.001$). Moral framing was strongly favored by advocates in explicitly ethical contexts such as Pope Francis & Vatican Climate (advocate 42.6%, skeptic 3.7%; $Z=-12.6$, $h=-0.93$, $p<0.001$). Ideological/emotive framing showed consistent skeptic preference, strongest in topics linked to cultural identities like Pollinator Health & Species Extinction (skeptic 8.2%, advocate 3.1%; $Z=3.6$, $h=0.31$, $p<0.001$).

Problem-solution orientations. Finally, systematic differences emerged in problem-solution orientations by topic (Fig. 5). Skeptics more consistently emphasized problems or barriers associated with proposed solutions, particularly within renewable technology topics, exemplified by Electric Vehicles, Hybrid Transport & Battery Innovation (skeptic 44.1%, advocate 20.3%; $Z=7.6$, $h=0.55$, $p<0.001$). In contrast, advocates clearly prioritized explicit solutions, notably in contested policy topics such as Geoengineering & Negative Emissions (advocate 29.8%, skeptic 8.1%; $Z=-7.1$, $h=-0.57$, $p<0.001$), underscoring systematic rhetorical differences between these strategies.

Expressed emotions Topic-level analysis of explicitly expressed emotions provided additional clarity to the aggregate emotional patterns identified previously (Fig. 6). Statistically significant differences emerged for two key topics: Paris Agreement & Global Climate Governance ($\chi ^2=29.93$, FDR-corrected $p=0.0054$), and US Climate & Energy Policy: Clean Energy & Politics ($\chi ^2=27.87$, FDR-corrected $p=0.0237$).

Discussion

Our results reveal clear rhetorical and emotional distinctions between advocate and skeptic climate perspectives, resonating strongly with existing literature on populist rhetoric, narrative framing, and emotional appeals. Skeptics consistently deploy anti-elite perspectives at a significantly higher rate than advocates, particularly within ecological and policy contexts, reinforcing the findings of¹², who emphasize how skeptic rhetoric frequently casts established authorities—whether scientific bodies or governmental institutions—as untrustworthy or disconnected from the public’s concerns. This aligns with broader trends identified by¹³ and¹⁴, who note rising distrust in institutions and science following the COVID-19 pandemic. Conversely, advocate perspectives emphasize crisis-framing more frequently, especially in contexts of acute threats like natural disasters, underscoring urgency to mobilize immediate policy action¹⁵. These contrasting populist features resonate with¹⁶’s observation that skeptics often dismiss advocate crisis claims as exaggerated, labeling climate advocates as alarmists.

Contrasts also emerge in the emotional appeals employed by advocates and skeptics. Advocate perspectives predominantly leverage fear and sadness, strategically heightening perceptions of threat or fostering empathic concern to mobilize collective action, consistent with prior literature on emotional mobilization^17,18,19. Skeptics, by contrast, rely heavily on anger and disgust, emotions documented to foster in-group cohesion, resistance to external perspectives, and intensified moral boundaries^20,21,22. Such moralizing outrage may serve not merely to discredit opposing views but also to reinforce ideological opposition to mainstream climate policy.

Importantly, our results also distinguish between strategic emotional appeals and emotions explicitly expressed by content creators. Skeptics consistently express overt disapproval, reinforcing their oppositional stance, whereas advocates more commonly express approval and nervousness, perhaps reflecting an orientation towards positive advocacy coupled with genuine concern over climate threats. As indicated by¹⁷, explicit emotional expressions can induce distinct psychological effects compared to strategic emotional appeals, influencing audience appraisals of risk, personal agency, and preferences for policy actions.

Both groups extensively employ scientific or technical arguments, yet skeptics invoke such framing more frequently. As suggested by²³, this greater reliance on technical rhetoric by skeptics often seeks to legitimize alternative, minority viewpoints, setting up a “duelling scientists” scenario. Skeptics strategically use scientific framing to cultivate an image of credible opposition, reflecting what²⁴ term “environmental skepticism,” a tactic designed to sow doubt in scientific consensus and, ultimately, climate policy. Advocates, meanwhile, predominantly employ pragmatic or economical framing, emphasizing actionable policy solutions and economic feasibility–consistent with an action-oriented approach to climate advocacy described by²⁵. Such pragmatic discourse not only communicates the practicality of climate solutions but also positions climate action within an economically beneficial framework.

Our analysis further underscores substantial differences in problem versus solution orientation. Skeptics consistently emphasize potential obstacles and negative externalities associated with climate interventions, particularly within technologically oriented or policy-intensive topics such as renewable energy and electric vehicles. This aligns with findings by⁵, highlighting that contemporary skeptic discourse tends to concentrate on secondary or tertiary obstruction strategies, shifting the focus from denying climate change outright to emphasizing doubts about proposed solutions’ viability or fairness²⁶. Conversely, advocate texts systematically emphasize explicit solutions, reflecting a strategic choice to galvanize collective action through tangible proposals, as found by²⁷.

This study offers three key methodological contributions. First, it introduces what appears to be the first large-scale comparative dataset of advocate and skeptic climate content. Second, it presents a novel segmentation algorithm based on text embeddings, specifically designed to preserve critical rhetorical nuances often lost through standard paragraph splitting or broad document summarization. Third, it demonstrates a fully unsupervised analysis pipeline capable of capturing subtle rhetorical and emotional distinctions that conventional supervised approaches tend to obscure. By removing reliance on researcher-defined coding schemes, this approach supports more flexible data-driven insights.

Our findings also open several promising avenues for future research. Researchers could further leverage chunk-level units by clustering them independently of document structure, isolating specific debates–such as controversies surrounding wind turbine recycling–to explore systematically whether skeptics consistently portray these issues as insurmountable barriers while advocates frame them as solvable challenges. Additionally, enriching chunk-level insights with supplementary features such as finer-grained populist framing and emotional appeals could reveal whether particular rhetorical strategies co-occur systematically at the sub-topic level, illuminating how arguments and emotions diffuse within broader advocate or skeptic discourses.

Experimental and survey-based studies may clarify how combinations of emotional appeals and problem-solution framing shape public opinion, particularly examining whether anger-based skeptic messaging elicits stronger policy resistance compared to fear-oriented advocate messaging. Identifying which counter-framing strategies most effectively mitigate the polarizing impacts of anger and disgust-based appeals represents a acutely rich direction. Finally, targeted investigations into how mainstream media adapt and propagate these competing perspectives would elucidate the pathways through which niche advocate and skeptic arguments enter wider public discourse, enhancing our understanding of perspective diffusion and influence.

Taken together, these directions lay the foundation for transitioning from exploratory discovery toward more hypothesis-driven research, offering a richer understanding of how competing climate perspectives interact, evolve, and ultimately influence public debate and policy outcomes.

Methods

Data collection and preprocessing

The definition of discourse structures is contested across climate communications literature; as a result, there is “little consensus” on what makes a coherent textual unit. We aimed to account for different levels of abstraction, often overlooked in discourse analysis. For example, a textual unit could be about polar ice caps generally, or about a specific region of polar caps. We therefore distinguished between micro and macro level perspectives, drawn from the document-level or chunk-level, respectively.

To construct a robust dataset for analyzing climate change perspectives, we curated two distinct corpora–an advocate corpus and a skeptic corpus–each representing opposing argumentative frames within climate discourse. Standard NLP preprocessing techniques, such as lemmatization, stop-word removal, and lowercasing, are commonly applied to reduce textual variability and improve computational efficiency in tasks like document classification and topic modeling. These approaches are particularly useful in methods that rely solely on word frequency distributions, where surface-level differences in wording can introduce noise rather than meaningful distinctions. However, in the context of text analysis, such standardization can be detrimental. By stripping away stylistic and rhetorical markers–including emphasis, modality, and lexical choice–these techniques risk erasing the very features that distinguish competing perspectives within a shared topical domain. Given that all documents in our dataset inherently pertained to climate change, the critical challenge was not identifying the broad subject matter but rather capturing the divergent ways in which it is framed. To preserve these discursive features, we employed state-of-the-art sentence and document embedding models, which generate rich semantic representations while mitigating the impact of non-informative noise. Our pre-processing pipeline was designed to remove extraneous content, such as boilerplate text, without distorting the linguistic structures essential for clustering and discourse analysis, ensuring that the argumentative and stylistic dimensions of climate discourse remain intact.

Corpus selection and construction

Advocate Corpus: In consultation with climate communication experts, we identified 55 widely recognized advocate blogs as potential sources. Selection was based on known prominence within climate discourse rather than explicit inclusion criteria. However, scrapability was a key constraint: sources that could not be reliably crawled due to restrictive robots.txt policies or that lacked structured textual content (e.g., video-based sites) were excluded. After filtering for accessibility, the final corpus consisted of 26 sources. The dataset comprises 21,513 articles, written in English but sourced from a variety of international outlets. To ensure consistency and mitigate extreme length variations, articles were filtered to a minimum of 250 characters and a maximum of 25,000 characters. The full list of advocate sources is provided in Supplementary Information A.

Skeptic Corpus: To ensure comparability, we adopted Coan et al.’s (2021) list of skeptic sources. The original list contained 34 sources; however, after applying the same selection and filtering criteria as for the advocate corpus, we refined this to 17 sources, collectively yielding 26,923 articles. These articles, sourced from various international domains, were subject to the same length constraints as the advocate corpus. The full list of skeptic sources is documented in Supplementary Information A.

Both corpora were filtered to retain only English-language content, and duplicate articles were removed during the scraping stage to ensure dataset integrity.

Web scraping and initial cleaning

To extract relevant articles from the selected advocate and skeptic sources, we developed a domain-adaptive web crawling pipeline capable of handling diverse website structures. The process was tailored to ensure comprehensive retrieval while mitigating common web scraping challenges, including access restrictions, dynamic content rendering, and redundant link traversal.

To maximize coverage while minimizing noise, we employed a multi-layered extraction strategy (see Fig. 7 for visual representation of the data collection pipeline):

Domain-adaptive crawling rules to restrict extraction to subdomains containing relevant content, preventing retrieval of peripheral pages.
Dynamic link-following heuristics to identify and prioritize primary article pages while avoiding recursive pagination loops.
Content-aware filtering to systematically exclude non-article elements such as advertisements, navigation menus, and embedded media.

Given the structural variability across sources, we implemented targeted site-specific extraction rules. For example, some advocate websites exclusively published articles within dedicated subdirectories, necessitating precise subdomain filtering to avoid indexing press releases, event listings, or user-generated comments. Our approach ensured that only substantive editorial content was retained.

Once collected, articles underwent an initial normalization step to eliminate structural inconsistencies while preserving linguistic integrity. This included stripping HTML tags, decoding encoded characters, and standardizing whitespace. Unlike conventional NLP preprocessing pipelines that impose aggressive text normalization (e.g., lemmatization, stop-word removal, case-folding), we deliberately preserved capitalization, punctuation, and stylistic markers. This decision was driven by the need to maintain rhetorical and persuasive elements essential for downstream analysis. Additionally, non-textual artifacts such as tables, inline metadata, and formatting redundancies were systematically removed to enhance content uniformity.

Iterative noise reduction

Despite initial structural cleaning, many documents contained additional noise such as disclaimers, copyright statements, automated author attributions, and redundant editorial notices. Left unfiltered, these artifacts could distort downstream analyses, obscuring substantive discourse structures. We accordingly adopted a four-stage noise reduction pipeline that was iteratively applied to the data.

Preliminary clustering

To isolate documents dominated by noise (i.e., non-substantive content, we first applied unsupervised document clustering. Document embeddings were projected into a lower-dimensional space using Uniform Manifold Approximation and Projection (UMAP), then grouped via Hierarchical Density-Based Spatial Clustering of Applications with Noise (HDBSCAN). This procedure reliably uncovered clusters populated by purely promotional announcements, month-end highlight compilations, or event-related bulletins. Each cluster was inspected to verify that the material contributed little meaningful discourse, and we iteratively refined the clustering parameters to preserve legitimate articles. Through this process, documents that lacked any substantial argumentative content were eliminated, substantially reducing the volume of noise at the outset.

TF-IDF-based lexicon construction

Many articles still contained repetitive phrases that did not conform to obvious templates (such as advertisements for other articles, commonly repeated disclaimers, syndicated editorial statements, standardized fundraising appeals, and references to social media platforms). To locate these subtler forms of noise, we employed Term Frequency–Inverse Document Frequency (TF-IDF) analysis at the n-gram level. We extracted the top fifteen highly weighted n-grams for each cluster, focusing on phrases ranging from five to thirty tokens in length. Shorter sequences generally produced false positives due to their generic nature, whereas longer ones proved too variable to capture recurrent patterns. Through multiple iterations of manual validation, we compiled a lexicon of commonly repeated disclaimers. This lexicon was refined to exclude any genuinely argumentative expressions that might have been inadvertently flagged.

Lexicon-guided filtering

Building on the TF-IDF-derived lexicon, we then implemented a rule-based filtering process that merged three key strategies. First, we employed regular-expression patterns to detect the templated text recurring across multiple sources. In the skeptic corpus, these patterns were extended to remove bracketed citations and formal reference lists, where placeholders like “References:” or “Works Cited” signaled residual bibliographic material. Second, we leveraged named entity recognition (NER) to distinguish genuine references to individuals or organizations from formulaic attributions and closing remarks. This distinction was particularly relevant in advocate articles, where contributor credentials were appended at the end, and in skeptic blogs, which often contained highly personalized sign-offs by authors. Finally, we conducted a sentence boundary normalization step to eliminate punctuation artifacts, truncated fragments, and inconsistent spacing introduced by the filtering process.

Advanced embedding-based cleaning

Although our lexicon-guided filtering identified and removed many sources of repetitive text, certain “loosely-repetitive” segments persisted across multiple documents (for example, different journalists’ biographies). Problematically, these loosely-repetitive segments were influencing cluster formation. To address this, we represented each sentence as a 1024-dimensional vector using the Alibaba-NLP/gte-large-en-v1.5 model, which augments a transformer-based encoder with rotary position embeddings (RoPE) and gated linear units (GLU) and has demonstrated strong performance on a variety of NLP benchmarks. These vectors were processed in batches of up to 2000 articles at a time, with a model encoding batch size of 64 to manage computational load.

The resulting embeddings were indexed with FAISS²⁸ using a hierarchical navigable small world index, configured with 32 neighbors. We normalized each embedding to unit length and performed approximate nearest-neighbor searches under a cosine similarity metric, retrieving the 12 most similar sentences (k = 12) for each query. Any sentence that appeared in at least 10 distinct articles at a similarity of 0.95 or higher was flagged for further inspection.

Manual review confirmed whether flagged sentences contained substantive argumentative content or represented non-informative boilerplate. Within the advocate corpus, such boilerplate often appeared near the ends of articles, making it feasible to target the final ten sentences in each piece. By contrast, the skeptic corpus featured a more diffuse distribution of noise, including in-text advertisements and atypical formatting scattered throughout entire documents. While this embedding-based method yielded marked improvements for the advocate corpus—where residual noise clustered predictably near article conclusions—it proved less effective on skeptic texts, which featured thinner and more dispersed artifacts. Nonetheless, it substantially minimized repetitive placeholders across both corpora and preserved the core argumentative material essential for subsequent analyses.

Dual-level perspective analysis framework

To rigorously explore the structure of the perspectives, we developed a dual-level analytical framework that integrates both macro- and micro-level. At the macro level, we examine entire documents to capture broad thematic trends and overarching discursive patterns. In contrast, our micro-level analysis dissects texts into semantically coherent segments—hereafter referred to as chunks—to reveal the nuanced argumentative and discursive elements within each document. We apply clustering and topic modeling at the document level, and compute chunk-level linguistic and rhetorical features that are aggregated by document-level topics, enabling a multi-scale view that links overall discursive context with localized signals.

Adaptive segmentation for micro-level perspective extraction

Traditional segmentation techniques, which rely on fixed lengths or syntactic cues, often fail to capture the fluid nature of discursive progression. To address this limitation, we introduce an adaptive segmentation procedure inspired by Retrieval-Augmented Generation (RAG) methods. This approach prioritizes semantic coherence, dynamically identifying shifts in content and thereby providing a more natural decomposition of text.

The segmentation process begins by partitioning the document into sentences using a dependency-based parser, ensuring that syntactic boundaries are respected. Each sentence is subsequently encoded into a 1024-dimensional dense vector via a medium-scale transformer model dunzhang/stella_en_400M_v5, before being fed to a dynamic sliding window mechanism to form candidate segments. Starting with a one sentence context window, the process iteratively extends the window by computing the cosine similarity between the embedding of the current window’s last sentence and that of the next sentence. If this similarity meets or exceeds a predefined threshold, $T_{\textrm{sim}}$, the sentence is appended to the window. This continues until the similarity drops below the threshold, ensuring that only semantically cohesive sentences are grouped together; the candidate window is then represented by the mean of its constituent embeddings.

Once candidate chunks are formed, segmentation decisions are made based on two criteria. First, the direct similarity threshold, $T_{\textrm{direct}}$, ensured that weakly related or unrelated chunks remain separate. If the mean embedding similarity between consecutive chunks falls below this threshold, a split is enforced to prevent conflating distinct ideas.

Next, a gradient-based segmentation step is performed to capture abrupt shifts that may not be detected by direct similarity alone. Gradients, defined as absolute differences between consecutive inter-chunk similarities, measure local fluctuations in semantic continuity. If a gradient exceeds a gradient percentile threshold, $T_{\textrm{grad}}$, a boundary is introduced to prevent the continuation of a segment through a sharp transition, even when overall similarity remains above $T_{\textrm{direct}}$. This step ensures that segmentation is sensitive to sudden changes in discourse that might otherwise be overlooked.

The three thresholds, $(T_{\textrm{sim}}, T_{\textrm{direct}}, T_{\textrm{grad}})$, collectively determine the granularity of segmentation, balancing coherence and differentiation. Algorithm 1 outlines the full procedure, Fig. 8 illustrates the process as a flow chart, and Figure S1 provides a visual representation of its application. The resulting chunks represent locally consistent building blocks for each document’s perspective, enhancing the granularity and interpretability of intra-document analysis.

Segmentation optimization

Although qualitative inspection can guide reasonable guesses for $(T_{\textrm{sim}}, T_{\textrm{grad}}, T_{\textrm{direct}})$, we adopted a more systematic approach by optimizing these parameters via Bayesian search. Specifically, each candidate parameter set induced a distinct segmentation of the corpus. We then measured the average intra-segment cohesion, defined as the mean pairwise similarity among embeddings within each segment, and the inter-segment separation, computed by taking one minus the similarity of consecutive segments’ mean embeddings. Formally, for a document segmented into $\{U_1, \dots , U_m\}$, let

$$\begin{aligned} \textrm{Coh}(U)&= \frac{1}{m} \sum _{k=1}^m \textrm{mean} \Bigl ( \cos (\textbf{e}, \textbf{e}') : \textbf{e}, \textbf{e}' \in U_k \Bigr ), \\ \text {and} \quad \textrm{Sep}(U)&= \frac{1}{m-1} \sum _{k=1}^{m-1} \Bigl [ 1 - \cos (\mu _{U_k}, \mu _{U_{k+1}}) \Bigr ], \end{aligned}$$

where $\mu _{U_k}$ is the mean embedding of segment $U_k$. These metrics are combined in a single objective:

$$J(\theta ) \;=\; \alpha \,\textrm{Coh}(U) \;+\; \beta \,\textrm{Sep}(U), \quad \alpha + \beta = 1,$$

which rewards high within-segment consistency and clear separation between segments. A Gaussian process model iteratively proposes new parameter sets $\theta = (T_{\textrm{sim}}, T_{\textrm{grad}}, T_{\textrm{direct}})$ in order to maximize $J(\theta )$. This procedure converges to an optimal trade-off that balances cohesive segmentation with sufficient granularity to isolate distinct argumentative units.

By integrating dynamic window formation with gradient-based splitting, forced separation for dissimilar windows, and Bayesian parameter tuning, our unique approach yielded semantically coherent segments.

Data representation

To produce document-level embeddings, we used a large-scale transformer-based model (SFR-Embedding-2_R from Salesforce) to generate 4096-dimensional vectors for each full article. Such embeddings can capture rich semantic detail across the breadth of an article; however, models of this size typically demand substantial GPU memory, motivating the quantization of parameters to half precision (8-bit). Research on quantized inference (Micikevicius et al., 2018; Kalamkar et al., 2019) indicates that precision reduction often preserves predictive accuracy for established models, provided no further training is required. Although a large embedding dimensionality can be advantageous, it also intersects with the “curse of dimensionality” phenomenon (Bellman, 1961; Allaoui et al., 2020) when clustering in Euclidean space, where similarity metrics become less discriminative as dimensions increase. We mitigated this risk by applying Uniform Manifold Approximation and Projection (UMAP; McInnes et al., 2018; Aysaky & Mandala, 2021; Wang et al., 2021) to reduce each 4096-dimensional vector to a more tractable subspace. UMAP is noted for its robust preservation of local geometric structures. Recent work also highlights dimension-reduced embeddings improve clustering efforts (Allaoui et al., 2020), and our preliminary evaluations confirmed an improved interpretability and efficiency once the embeddings were mapped to a lower-dimensional manifold.

To produce chunk-level embeddings, we used a medium-scale transformer-based model (dunzhang/stella_en_400M_v5) that was benchmarked for strong performance on sentence- to paragraph-scale inputs. Each semantically segmented chunk was encoded as a 1024-dimensional vector, reflecting the model’s capacity to succinctly capture multi-sentence arguments. Since each corpus contained $\sim$ 20 - 25 000 documents but yielded $\sim$ 700 000 chunks once subdivided, we stored these chunk embeddings in a persistent vector database (Chroma) to ensure efficient retrieval and scalability. The curse of dimensionality remained relevant. As a result, we again invoked UMAP-based dimension reduction—down to 10 dimensions—balancing compactness and semantic fidelity.

Clustering with HDBSCAN

All dimensionality-reduced embeddings were clustered using HDBSCAN, an extension of the density-based DBSCAN algorithm that integrates hierarchical clustering (Campello, Moulavi, and Sander, 2013). In classic DBSCAN, users must manually select a single density threshold (i.e., the minimum density required for points to be considered part of a cluster). HDBSCAN, by contrast, evaluates multiple density thresholds to build a hierarchy of potential clusters. Clusters that do not remain stable under small perturbations are pruned from the final hierarchy (Aysaky and Mandala, 2021), enabling the algorithm to capture variable densities in different portions of the data. HDBSCAN also does not require an explicit number of clusters in advance, letting the data “speak for itself” with minimal assumptions (Stewart and Al-Khassaweneh, 2021).

To deploy HDBSCAN, we primarily controlled two parameters: $\text {min}\_\text {cluster}\_\text {size}$ (the minimum number of points to form a valid cluster) and $\text {min}\_\text {samples}$ (the number of neighbors to calculate local density). Because the corpora all revolve around climate change, we sought to discover relatively fine-grained clusters that reflect diverse discursive positions within a single broad topic. We further leveraged HDBSCAN’s hierarchical outputs (parent-child relationships among clusters) to understand how the data subdivided under stricter or looser density requirements. Points not meeting the density threshold were assigned the label $-1$, thus counted as noise rather than forced into weak clusters.

Hyperparameter optimization

Each embedding set had its own trade-offs in terms of dimensionality and cluster granularity. We therefore employed Bayesian optimization to systematically tune both our dimensionality-reduction and HDBSCAN parameters. This process was repeated separately for advocate vs. skeptic corpora, allowing each sub-dataset to converge on its own optimal set of hyperparameters.

Search Space and Pipeline

Dimensionality reduction was performed with UMAP, parameterized by $(n_{\text {neighbors}}, n_{\text {components}})$. Clustering was governed by the two main HDBSCAN parameters, $\text {min}\_\text {cluster}\_\text {size}$ and $\text {min}\_\text {samples}$. Additionally, two coefficients, $\alpha$ and $\gamma$, were used in our final objective function to penalize noise and balance the number of clusters discovered. This yielded a six-parameter search space:

$$\theta = \{ n_{\text {neighbors}},\, n_{\text {components}},\, \text {min}\_\text {cluster}\_\text {size},\, \text {min}\_\text {samples},\, \alpha ,\, \gamma \}.$$

For each candidate configuration $\theta$, the original embeddings $E$ were first reduced to $\textbf{E}_{\text {red}}$ via UMAP, then clustered with HDBSCAN to yield cluster labels $\mathcal {C}$. Over multiple iterations, a Gaussian process adaptively sampled new parameter points $\theta$ likely to improve clustering quality.

Evaluation metrics and objective function

We tracked three principal metrics on the cluster assignment $\mathcal {C}$. First, we measured the relative validity $V(\mathcal {C})$ reported by HDBSCAN, which reflects internal density-based consistency. Second, the noise ratio $R(\mathcal {C})$ was calculated by

$$R(\mathcal {C}) = \frac{|\{ i \,\mid \, \mathcal {C}_i = -1 \}|}{|E|},$$

reflecting the fraction of points labeled as noise. Third, we recorded the total number of clusters $N_c$, assessing whether a parameter set yielded excessive or insufficient granularity in cluster formation. We unified these into a single objective:

$$J(\theta ) = V(\mathcal {C}) + \gamma \,\log \bigl (1 + N_c\bigr ) \;-\; \alpha \, R(\mathcal {C}),$$

where $\alpha$ penalized excessive noise, $\gamma$ gently encouraged multiple clusters, and $\log (1 + N_c)$ moderated extreme fragmentation. The parameter vector $\theta ^*$ maximizing $J(\theta )$ defined the optimal UMAP and HDBSCAN settings for each sub-dataset.

Topic modeling with BERTopic

We employ BERTopic²⁹ to derive interpretable topic representations from clusters obtained via our pre-computed UMAP and HDBSCAN pipeline. In our workflow, each document is already assigned to a cluster $\mathcal {C}_i$ (with $\mathcal {C}_i = -1$ denoting noise), and BERTopic is used solely for topic extraction rather than for dimensionality reduction or clustering. Let the total number of clusters be $K$, ignoring noise (i.e., clusters with index $-1$).

Class-Based TF–IDF ($\mathcal{T}\mathcal{F}_c$). For each cluster $\mathcal {C}_i$ (where $\mathcal {C}_i \ne -1$), we compute a specialized term-weighting scheme, referred to as class-based TF–IDF and denoted as $\mathcal{T}\mathcal{F}_c$. In contrast to standard TF–IDF, which often measures term frequency across individual documents and inverse document frequency across the entire corpus, class-based TF–IDF treats each cluster as a “class” of documents. Formally, let $t$ be a term (token, unigram, or $n$-gram) and $\mathcal {C}_i$ be the $i$-th cluster. Define:

$$\text {TF}(t, \mathcal {C}_i) \;=\; \sum _{d \in \mathcal {C}_i} f_{t,d},$$

where $f_{t,d}$ is the frequency of term $t$ in document $d$. This $\text {TF}$ aggregates the counts of $t$ across all documents in the cluster $\mathcal {C}_i$. The inverse document frequency term is replaced with a notion of inverse class frequency:

$$\text {ICF}(t) \;=\; \log \,\Bigl (\frac{N}{1 + \sum _{j=1}^K \mathbb {1}\bigl [t \in \mathcal {C}_j\bigr ]} \Bigr ),$$

where $N$ is the total number of documents in the corpus, and $\sum _{j=1}^K \mathbb {1}\bigl [t \in \mathcal {C}_j\bigr ]$ denotes the number of clusters in which $t$ appears at least once. Finally, the class-based TF–IDF weight is:

$$\mathcal{T}\mathcal{F}_c(t, \mathcal {C}_i) \;=\; \underbrace{\frac{\text {TF}(t, \mathcal {C}_i)}{\sum _{w}\text {TF}(w, \mathcal {C}_i)}}_{\text {normalized term frequency in cluster}} \;\times \; \underbrace{\log \,\Bigl (\frac{N}{1 + \sum _{j=1}^K \mathbb {1}\bigl [t \in \mathcal {C}_j\bigr ]}\Bigr )}_{\text {inverse class frequency}}.$$

This formulation emphasizes terms that are frequent within $\mathcal {C}_i$ and exclusive to $\mathcal {C}_i$ relative to other clusters. The outcome is a sorted list of candidate keywords for each cluster, capturing its unique thematic fingerprint.

MMR-Based Re-Ranking. After obtaining an initial set of top-scoring keywords via $\mathcal{T}\mathcal{F}_c$, we apply additional re-ranking mechanisms. First, part-of-speech–based filters remove generic function words (e.g., determiners, auxiliaries) and less informative tokens. Next, we employ Maximal Marginal Relevance (MMR)³⁰ to balance a keyword’s relevance to the cluster with its diversity relative to already-selected keywords. Concretely, let $R$ be the set of candidate keywords from the $\mathcal{T}\mathcal{F}_c$ stage, and let $S$ be the subset of keywords already chosen. We define:

$$\textrm{MMR}(k) \;=\; \lambda \,\textrm{Sim}(k, Q) \;-\; (1-\lambda ) \,\max _{k^\prime \in S} \textrm{Sim}(k, k^\prime ),$$

where $k \in R \setminus S$, $Q$ is a representation of the cluster (e.g., a centroid embedding), $\textrm{Sim}$ is a similarity function (often cosine similarity between embedding vectors), and $\lambda \in [0,1]$ is a hyperparameter controlling the trade-off between relevance ($\textrm{Sim}(k,Q)$) and diversity ($\max _{k^\prime \in S}\textrm{Sim}(k,k^\prime )$). At each iteration, we select:

$$k^* \;=\; \arg \max _{k \in R\setminus S}\; \textrm{MMR}(k),$$

and append $k^*$ to $S$. This process repeats until we meet a specified cutoff (e.g., the top $n$ keywords). By penalizing high overlap with already-selected terms, MMR ensures that the final keyword set is both representative of the cluster’s content and non-redundant, leading to more concise topic descriptors.

Aspect Extraction. While the above MMR-based approach yields a coherent set of keywords per cluster, we further enrich each cluster representation through the concept of aspects. Rather than converging on a single list of keywords, we produce three parallel views that highlight different semantic angles:

Main: A standard keyword-extraction procedure inspired by KeyBERT³¹, configured to return $\texttt {top\_n\_words} = 15$. KeyBERT leverages transformer-based language embeddings (e.g., Sentence-BERT) to compare candidate terms against the mean embedding of the entire cluster, thus prioritizing terms that are semantically central to the cluster’s content. This view is concise yet informative.
Aspect 1: A part-of-speech (POS) filter using a lightweight linguistic model (e.g., $\texttt {''en\_core\_web\_sm''}$) to identify and retain only core content words, such as nouns, verbs, and adjectives. By excluding high-frequency function words (articles, auxiliaries, etc.), Aspect 1 emphasizes the syntactic and stylistic traits of the cluster’s documents, often exposing more domain-specific or genre-defining language.
Aspect 2: A broader pipeline that first extracts a larger pool of up to 30 candidate keywords (again using a KeyBERT-inspired procedure) and then applies an MMR-based re-ranking step with a higher diversity parameter ($\texttt {diversity} = 0.8$). This mirrors the MMR framework described above, but emphasizes an even stronger penalty for overlap between candidate terms. The result is a diverse keyword set that captures a wider range of thematic nuances within the cluster.

In practice, this threefold representation (Main, Aspect 1, and Aspect 2) ensures that we do not collapse potentially important distinctions into a single keyword list. The Main set is optimized for concise interpretability, Aspect 1 foregrounds linguistic properties, and Aspect 2 explicitly leverages strong MMR diversity to highlight a broader semantic spread. Collectively, these aspects provide a more comprehensive view of each cluster and serve as richer inputs for downstream labeling and analysis.

Representative Documents. Finally, for each cluster $\mathcal {C}_i$, a set of representative documents is retrieved to supplement the purely keyword-based representations. We identify these exemplars either by:

Centroid Proximity: Ranking documents in $\mathcal {C}_i$ by their cosine similarity to the cluster centroid in the embedding space (UMAP or original sentence embeddings).
Keyword Overlap: Ranking documents by how many of the top keywords (from $\mathcal{T}\mathcal{F}_c$ or MMR) they contain, giving preference to documents that collectively span the cluster’s range of topics.

Formally, for a chosen similarity or overlap function $\textrm{RepScore}(d,\mathcal {C}_i)$, we take the top $m$ documents:

$$\text {RepDocs}(\mathcal {C}_i) \;=\; \arg \max _{\begin{array}{c} D \subseteq \mathcal {C}_i \\ |D| = m \end{array}} \sum _{d \in D} \textrm{RepScore}(d,\mathcal {C}_i).$$

These representative documents provide qualitative evidence to validate or refine the cluster’s topic description.

BERTopic Output and LLM-Based Labeling. Each cluster’s BERTopic output comprises multiple elements:

1.
An automatically generated topic name based on the initial top keywords.
2.
A quantitative representation of the topic: the weighted distribution of keywords via $\mathcal{T}\mathcal{F}_c$.
3.
Two aspects ($\mathcal {A}_1$ and $\mathcal {A}_2$) capturing different semantic angles.
4.
A collection of representative documents, $\text {RepDocs}(\mathcal {C}_i)$.

These elements are collated into a single output row for each cluster. The row is subsequently passed to a zero-shot large language model (LLM) pipeline, where the LLM is provided with:

the candidate topic name,
numerical keyword distribution from $\mathcal{T}\mathcal{F}_c$,
aspect-specific keyword sets,
representative documents,

and prompted to synthesize a final human-readable topic label.

This final label reconciles both statistical signals—derived from $\mathcal{T}\mathcal{F}_c$ and MMR-based re-ranking—and qualitative insights—from the representative documents. The convergence of multiple representations (quantitative keyword distributions, dual semantic aspects, and document exemplars) helps ensure that the topic labels are:

1.
Methodologically grounded: built upon well-defined metrics ($\mathcal{T}\mathcal{F}_c$, MMR, aspect derivation).
2.
Semantically coherent: robustly checked against actual text segments within the cluster.

Feature extraction

While clustering and topic modeling elucidate the thematic organization of both document- and chunk-level texts, they do not capture the full spectrum of rhetorical and stylistic features that shape climate-change perspectives. Persuasion in climate discourse is not solely a function of content but is critically mediated by how arguments are articulated. Rhetorical style and emotional appeal play pivotal roles in determining discourse impact, audience engagement, and ideological positioning^32,33. Prior research in this domain has often relied on manual coding or deductive, dictionary-based approaches, which constrain scalability and limit interpretive flexibility.

To address these limitations, we adopt an exploratory, computational approach that integrates a zero-shot chunk classification and sentence based sentiment analysis. This methodology enables a data-driven examination of climate discourse without the imposition of rigid, a priori categories. In our zero-shot pipeline, we quantify stylistic features that capture the use of emotional appeals, problem-solution entailment, and populist features—mechanisms that underlie persuasive rhetoric. By extracting these attributes at multiple scales, from fine-grained semantic chunks to aggregated document-level patterns, we obtain a multidimensional perspective on both the content and the form of climate-change perspectives.

Chunk-Level Rhetorical Features. Given that semantic chunks (as derived from our adaptive segmentation procedure) are more internally coherent than full documents, we perform fine-grained classification of rhetorical elements at the chunk level.

Populist Feature Extraction. In parallel, we extract four populist, and one frame based rhetorical feature using a state-of-the-art distilled large language model via the Ollama library. The model, identified as DeepSeek-R1-Distill-Qwen-32B, is run with a temperature of zero to ensure reproducibility. A custom prompt instructs the model to analyze the text for populist markers, returning a JSON object that adheres to a strict schema. Specifically, the prompt defines the following features:

anti_elite: Negative references to elites, experts, or institutions.
people_centrism: References to “the people,” “us vs. them,” or “ordinary citizens.”
crisis_framing: Alarmist or exaggerated language that describes threats or emergencies.
emotive_appeals: Explicit emotional triggers (choosing from emotions: fear, disgust, happiness, sadness, anger, or surprise).
simplistic_solutions: Reductive or oversimplified problem–solution mappings.
framing_type: The overall framing as one or more of the following types: scientific/technical, pragmatic/economical, moral/ethical, or ideological/emotive.

The prompt requires that the model output only valid JSON, with no extraneous text, ensuring that each feature is reported as a key–value pair (e.g., a boolean indicator for presence and a concise evidence snippet).

Problem–Solution Classification. To determine whether a chunk primarily diagnoses a climate issue, prescribes a solution, or remains neutral, we utilize a specialized zero-shot entailment pipeline. In this pipeline, each chunk is processed using the model MoritzLaurer/deberta-v3-large-zeroshot-v2.0 within the Transformers python library. The model is prompted with the hypothesis template:

$$\text {''This text presents a \{\} statement''}$$

and candidate labels [“problem”, “solution”, “neutral”]. This formulation enables the model to compute entailment probabilities for each label and thereby classify the discursive function of the text without requiring pre-specified training categories.

Emotion Distributions. In addition, we derive emotion distributions from each chunk using a RoBERTa-based transformer model fine-tuned on the GoEmotions dataset³⁴. This model returns the relative prevalence of 27 fine-grained emotions (e.g., admiration, amusement, anger, annoyance, approval, etc.), or neutral, further enriching our characterization of the rhetorical tone.

Data availability

The advocate and skeptic datasets, and all other data necessary to reproduce our results are available at https://figshare.com/s/27727fefc7c22cc31b0d.

Code availability

All code for this paper is available at https://github.com/BubbleElixir/reading-the-climate-room.

References

Constantino, S. M. & Weber, E. U. Decision-making under the deep uncertainty of climate change: The psychological and political agency of narratives. Curr. Opin. Psychol. 42, 151–159 (2021).
Article PubMed Google Scholar
Moser, S. Reflections on climate change communication research and practice in the second decade of the 21st century: What more is there to say?. WIREs Clim. Change. 7(3), 345–369. https://doi.org/10.1002/wcc.403 (2016).
Article Google Scholar
Lamb, W. et al. Discourses of climate delay. Glob. Sustain. https://doi.org/10.1017/sus.2020.13 (2020).
Article Google Scholar
Supran, G. & Oreskes, N. Assessing ExxonMobil’s climate change communications (1977–2014). Environ. Res. Lett. https://doi.org/10.1088/1748-9326/aa815f (2017).
Article Google Scholar
Coan, T. G., Boussalis, C., Cook, J. & Nanko, M. O. Computer-assisted classification of contrarian claims about climate change. Sci. Rep. 11(1), 22320 (2021).
Article CAS PubMed PubMed Central ADS Google Scholar
Grimmer, J., Roberts, M. & Stewart, B. Machine learning for social science: An agnostic approach. Annu. Rev. Polit. Sci. 24, 395–419. https://doi.org/10.1146/annurev-polisci-053119-015921 (2021).
Article Google Scholar
Boussalis, C. & Coan, T. G. Text-mining the signals of climate change doubt. Glob. Environ. Change. 36, 89–100 (2016).
Article Google Scholar
Grimmer, J. & Stewart, B. M. Text as data: The promise and pitfalls of automatic content analysis methods for political texts. Polit. Anal. 21(3), 267–297 (2013).
Article Google Scholar
Farrell, J. Corporate funding and ideological polarization about climate change. Proc. Natl. Acad. Sci. U. S. A. 113(1), 92–97 (2016).
Article CAS PubMed ADS Google Scholar
Egger, R. & Yu, J. A topic modeling comparison between lda, nmf, top2vec, and bertopic to demystify Twitter posts. Front. Sociol. 7, 886498. https://doi.org/10.3389/fsoc.2022.886498 (2022).
Article PubMed PubMed Central Google Scholar
Angelov, D. Top2vec: Distributed representations of topics. https://doi.org/10.48550/arXiv.2008.09470
Fiorino, D. J. Climate change and right-wing populism in the United States. Environ. Polit. 31(5), 801–819. https://doi.org/10.1080/09644016.2021.2018854 (2022).
Article Google Scholar
Huber, R. A., Greussing, E. & Eberl, J.-M. From populism to climate scepticism: The role of institutional trust and attitudes towards science. Environ. Polit. 31(7), 1115–1138. https://doi.org/10.1080/09644016.2021.1978200 (2022).
Article Google Scholar
Prasad, A. Anti-science misinformation and conspiracies: Covid–19, post-truth, and science & technology studies (sts). Sci. Technol. Soc. 27(1), 88–112. https://doi.org/10.1177/09717218211003413 (2022).
Article MathSciNet Google Scholar
Marquardt, J. & Lederer, M. Politicizing climate change in times of populism: An introduction. Environ. Polit. 31(5), 735–754. https://doi.org/10.1080/09644016.2022.2083478 (2022).
Article Google Scholar
Vihma, A., Reischl, G., Anderson, A.N & Berglund, S. Climate change and populism: Comparing the populist parties’ climate policies in denmark, finland, and sweden. Technical report, Finnish Institute of International Affairs (FIIA) (2020). https://fiia.fi/en/publication/climate-change-and-populism
Lerner, J. S. & Keltner, D. Beyond valence: Toward a model of emotion-specific influences on judgement and choice. Cogn. Emot. 14(4), 473–493. https://doi.org/10.1080/026999300402763 (2000).
Article Google Scholar
Lerner, J. S. & Keltner, D. Fear, anger, and risk. J. Pers. Soc. Psychol. 81(1), 146–159. https://doi.org/10.1037/0022-3514.81.1.146 (2001).
Article CAS PubMed Google Scholar
Renström, E. A., Bäck, H. & Carroll, R. Threats, emotions, and affective polarization. Polit. Psychol. 44(6), 1337–1366. https://doi.org/10.1111/pops.12899 (2023).
Article Google Scholar
Stephan, W. G., Ybarra, O. & Morrison, K. R. Intergroup threat theory. In Handbook of Prejudice, Stereotyping, and Discrimination (ed. Nelson, T. D.) 43–59 (Psychology Press, 2009). https://doi.org/10.4324/9781841697772.
Cosgrove, T. & Bahr, M. The language of conspiracy theories: Negative emotions and themes facilitate diffusion online. SAGE Open 14(4), 21582440241290412. https://doi.org/10.1177/21582440241290413 (2024).
Article Google Scholar
Inbar, Y. & Pizarro, D. A. How disgust affects social judgments. In Advances in Experimental Social Psychology Vol. 65 (ed. Gawronski, B.) 109–166 (Academic Press, 2022). https://doi.org/10.1016/bs.aesp.2021.11.002.
Jacques, P. J., Dunlap, R. E. & Freeman, M. The organisation of denial: Conservative think tanks and environmental scepticism. Environ. Polit. 17(3), 349–385 (2008).
Article Google Scholar
Dunlap, R. E. & McCright, A. M. Climate change denial: Sources, actors and strategies In (ed. Lever-Tracy, C.) (2010).
Lück, J., Wessler, H., Wozniak, A. & Lycarião, D. Counterbalancing global media frames with nationally colored narratives: A comparative study of news narratives and news framing in the climate change coverage of five countries. Journalism 19(12), 1635–1656. https://doi.org/10.1177/1464884916680372 (2018).
Article Google Scholar
Dannemann, H. Climate obstruction: How denial, delay and inaction are heating the planet: By Kristoffer Ekberg, Bernhard Forchtner, Martin Hultman and Kirsti M. Jylhä. Environ. Polit. 32(6), 1104–1106. https://doi.org/10.1080/09644016.2023.2215659 (2023).
Article Google Scholar
Cunningham, C., Foxcroft, C. & Sauntson, H. The divergent discourses of activists and politicians in the climate change debate: An ecolinguistic corpus analysis. Language and Ecology (2022)
Douze, M., Guzhva, A., Deng, C., Johnson, J., Szilvasy, G., Mazaré, P.-E., Lomeli, M., Hosseini, L. & Jégou, H. The Faiss Library (2024). https://doi.org/10.48550/arXiv.2401.08281 .
Grootendorst, M. Bertopic: Neural topic modeling with a class-based tf-idf procedure. arXiv preprint arXiv:2203.05794 (2022)
Carbonell, J. & Goldstein, J. The use of mmr, diversity-based reranking for reordering documents and producing summaries. In: Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 335–336 (1998)
Grootendorst, M. KeyBERT: Minimal keyword extraction with BERT. Zenodo https://doi.org/10.5281/zenodo.4461265 (2020).
Article Google Scholar
Bushell, S., Buisson, G. S., Workman, M. & Colley, T. Strategic narratives in climate change: Towards a unifying narrative to address the action gap on climate change. Energy Res. Soc. Sci. 28, 39–49. https://doi.org/10.1016/j.erss.2017.04.001 (2017).
Article Google Scholar
Schäfer, M. Online communication on climate change and climate politics: A literature review. WIREs Clim. Change 3(6), 527–543. https://doi.org/10.1002/wcc.191 (2012).
Article Google Scholar
Demszky, D., Movshovitz-Attias, D., Ko, J., Cowen, A., Nemade, G. & Ravi, S. Goemotions: A dataset of fine-grained emotions. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 4040–4054 (2020). https://doi.org/10.18653/v1/2020.acl-main.372 . https://aclanthology.org/2020.acl-main.372/

Download references

Funding

This work was supported by the Irish Research Council (Grant No. IRCLA/2022/3896).

Author information

Authors and Affiliations

School of Communications, Dublin City University, Dublin, Ireland
Lorin Sweeney, Rabhya Mehrotra, Fionna Saintraint, Robert A. Brennan & Jane Suiter

Authors

Lorin Sweeney
View author publications
Search author on:PubMed Google Scholar
Rabhya Mehrotra
View author publications
Search author on:PubMed Google Scholar
Fionna Saintraint
View author publications
Search author on:PubMed Google Scholar
Robert A. Brennan
View author publications
Search author on:PubMed Google Scholar
Jane Suiter
View author publications
Search author on:PubMed Google Scholar

Contributions

L.S. conceived and designed the study, collected the data, implemented the code, carried out the analyses, created all figures, interpreted the results, and drafted the manuscript. R.M. and F.S. provided critical feedback on the methodological design and on the interpretation of the results, and reviewed the manuscript. R.B. provided additional interpretation of the findings from a social-psychology perspective and reviewed and edited the manuscript. J.S. supervised the project, approved the methodological design, and provided review and editing of the manuscript. All authors reviewed the manuscript, approved the final version, and agree to be accountable for all aspects of the work.

Corresponding author

Correspondence to Lorin Sweeney.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Supplementary Information. (download PDF )

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Sweeney, L., Mehrotra, R., Saintraint, F. et al. Reading the climate room through unsupervised analysis of unfiltered climate perspectives. Sci Rep 16, 14828 (2026). https://doi.org/10.1038/s41598-026-44553-x

Download citation

Received: 17 June 2025
Accepted: 12 March 2026
Published: 24 March 2026
Version of record: 12 May 2026
DOI: https://doi.org/10.1038/s41598-026-44553-x