Abstract
Climate change is one of the defining challenges of our time, yet little is known about how early-career researchers contribute to this field through doctoral research. This study provides the first comprehensive mapping of climate change-related doctoral dissertations in Italy across all disciplines, spanning a 14-year period (2008–2021). Doctoral dissertations offer a unique lens into the formative stages of scientific inquiry, where new ideas, methods, and agendas take shape. Using a machine learning approach on a novel dataset of over 74,394 dissertations, we conduct the first large-scale classification of climate change dissertations in Italy. We identify climate-related dissertations and analyze their thematic, disciplinary, and geographical distribution, highlighting emerging research trends in areas such as energy transition, biodiversity conservation, and extreme weather events. While technical disciplines dominate among English-language dissertations, those written in Italian reveal a more balanced disciplinary landscape, with a stronger presence of the social sciences and humanities—though these remain underrepresented overall. Although climate-related research spans a variety of topics, regional variation also emerges: water in the North, energy in the Centre and South, and governance in the Islands. This study marks an important step toward recognizing doctoral research as a strategic asset in building resilient climate knowledge systems and guiding long-term policy planning.
Introduction
In recent decades, climate change has been widely recognized as a prominent indicator of the ongoing environmental crisis1. In response to this global challenge, the international institutional community has taken significant steps to mitigate its impact, exemplified by the landmark 2015 Paris Agreement, where representatives from 195 countries agreed on binding commitments to reduce greenhouse gas emissions. Given the global scale and urgency of the issue, the United Nations has identified climate change as one of the most pressing challenges of our time2. In this context, analyzing scientific production on climate change is crucial to identifying research trends, emerging themes, and disciplinary contributions. Doctoral dissertations represent a significant and often underexplored source of insight into academic research priorities. Unlike journal articles, which reflect consolidated knowledge, doctoral dissertations capture the early stages of scientific inquiry and innovation. They provide a window into how early-career researchers engage with climate change, revealing the methodologies they employ, the disciplinary perspectives they adopt, and the emerging trends among future generations of researchers. This is a timely matter due to the interest in the topic across the country, driven by growing awareness of risks related to climate change in the country3 and inclusion in national political debates4.
A recent special issue led by Biancalana et al.5 demonstrates how climate change is emerging as a topic of political contestation in Italy, along with a growing interest in academic research. We aim to provide a comprehensive analysis of Italian doctoral dissertations on climate change published between 2008 and 2021. Specifically, our investigation identifies the main topics addressed in these dissertations, shedding light on the key research areas that have shaped climate-related academic discourse in Italy. Furthermore, we trace the temporal evolution of climate change research in Italian academia, identifying significant trends and interests. By examining the disciplinary distribution of dissertations through the Italian scientific classification system (Scientific-Disciplinary Sector - SSD), we evaluate the contributions of different academic fields to climate change research and identify the main orientations. Additionally, we analyse the geographical distribution of doctoral research on climate change, mapping topics across macro-regions.
The methodology involved expert researchers manually labelling a subset of doctoral dissertations as climate change-related or unrelated, based on their titles and abstracts in both Italian and English. This manually-classified dataset was then used to train a supervised machine learning model, optimizing predictive performance through text mining techniques and cross-validation. Once the full corpus of climate change-related dissertations was identified, we conducted topic modelling and sentiment analysis to examine the key themes addressed in these dissertations. Sentiment analysis is a natural language processing (NLP) technique used to identify and classify the emotional tone of a text, typically as positive, negative, or neutral. It relies on computational methods, including machine learning and linguistic rules, to evaluate the sentiment expressed through words and phrases6. While various sentiment analysis approaches exist, the method we applied has been adapted specifically to suit the characteristics and context of our dissertation dataset, allowing for a more complex analysis of whether the texts primarily propose solutions, highlight challenges, or adopt a neutral perspective on climate-related issues. The trained topic model was subsequently applied to the entire dataset of Italian doctoral dissertations, enabling a comprehensive and automated mapping of climate change research across disciplines and macro-regions in Italy.
This study aligns with previous research exploring the representation of climate change in scientific literature7,8,9. Additionally, it contributes to the broader discussion on the role of various scientific disciplines in climate change research, particularly regarding the contribution of the social sciences, which are often underrepresented in climate-related studies compared to the natural sciences10. The complexity of climate change necessitates an interdisciplinary collaboration among natural sciences, social sciences and humanities, highlighting the need for a reframing of research questions and methodologies to foster better collaboration and understanding among disciplines11. To this aim, this study delves into the different meanings associated with climate change across the various disciplines and their degree of interest. By highlighting these differences, our work underscores the need for a more interdisciplinary approach, integrating perspectives from different disciplines to develop comprehensive strategies for addressing climate challenges.
Theoretical framework
Understanding the structure and evolution of climate change research requires a well-defined conceptual framework that integrates insights from bibliometric analysis and text mining. By systematically examining the thematic composition and disciplinary distribution of doctoral dissertations, this study aligns with a broader research tradition that seeks to map scientific knowledge and identify emerging trends. Numerous studies have explored specific subfields within climate change research. For example, researchers have analyzed climate change vulnerability12, the climate change controversy13, and adaptation strategies14. While valuable, these focused analyses provide a fragmented view of the overall research landscape. Several studies have provided comprehensive analyses of climate change research, highlighting key trends and developments in the field. Schwechheimer and Winterhager15 identified emerging research areas by examining publications in the Science Citation Index, while Li et al.16 assessed global climate change studies, focusing on patterns, trends, and methodologies. Haunschild et al.17 analyzed climate change literature, examining publications, research subfields, and the geographical distribution of research. Additionally, they assessed the citation impact to evaluate the influence and visibility of climate-related studies within the scientific community. Similarly, Sangam and Savitha18 examined climate change and global warming literature, emphasizing growth trends and research collaborations. More recently, Fu and Waltman19 conducted a large-scale bibliometric analysis of climate change publications from 2001 to 2018. Their findings reveal a shift in research focus from studying the climate system to addressing climate technologies and policies, such as energy efficiency and environmental legislation. They also highlight an imbalance in scientific production between developed and developing countries, and emphasize the influence of geography and national strategies in shaping research priorities worldwide. In line with these global patterns, Reale and Spinello20 analyzed the Italian R&D funding landscape and show that government funding has offered limited and short-term support for climate change research in academia.
Despite extensive research on global climate change, the role of doctoral dissertations in this field remains underexamined. A systematic analysis is essential to uncover emerging research trends, methodologies, and disciplinary contributions within climate change studies. Doctoral dissertations represent a significant body of research, often introducing novel perspectives and addressing emerging topics with a level of depth that may exceed that of journal articles. Additionally, they provide valuable contextual insights, helping to map the orientation of climate change research across disciplines. As a key component of the national research ecosystem, dissertations reflect the training, priorities, and intellectual contributions of early-career researchers21. This study positions doctoral research within the broader framework of national knowledge and innovation systems, emphasizing its role in shaping knowledge production and scientific advancement22,23. Beyond generating new insights, dissertations influence the evolution of disciplinary contributions, reflecting shifts in scientific priorities. A key aspect of this study is the in-depth analysis of the Scientific-Disciplinary Sectors (SSDs) contributing to climate change research. Rather than focusing solely on broad disciplinary categories, this approach enables a more granular understanding of the specialized fields addressing climate-related challenges. Identifying these areas of expertise is crucial for mapping the diversity of scientific knowledge involved and highlighting potential gaps or underrepresented disciplines in the national research landscape. In recent years, there has been growing recognition of the need to strengthen the role of social sciences in climate change research. This perspective is essential for understanding the anthropogenic drivers of climate change, the social dimensions of its impacts, and vulnerabilities, and the mechanisms for coordinated responses to climate threats. However, as Hiltner10 highlights, climate change has received limited attention within the social sciences, with disciplines such as sociology underrepresented in climate research and global assessments compared to the natural sciences. Through a comprehensive literature review, Hiltner underscores the need for greater engagement from social scientists in this field, arguing that their involvement is crucial for fostering a more socially informed understanding of and response to this global challenge. Nevertheless, some notable contributions from the social sciences do exist. Key research areas have included the interplay between climate vulnerability and adaptation policies24 and the role of social movements in shaping climate policies25.
This study also examines the territorial dimension of climate change research, recognizing that climate impacts and adaptation strategies are often highly context-dependent, shaped by geographic, environmental, and socio-economic conditions26,27. Understanding the extent to which the research interests of early-career scholars in Italian universities align with macro-regional contexts is essential for identifying potential connections between research priorities and regional characteristics. For example, coastal regions may prioritize studies on sea-level rise, coastal resilience, and marine ecosystem conservation, whereas mountainous areas might focus on climate change effects on biodiversity, water resource management, and natural hazard mitigation. Similarly, agricultural regions are likely to emphasize climate-smart farming techniques, sustainable irrigation methods, and the economic impacts of extreme weather events on rural livelihoods, while urban areas may center on sustainable city planning, air quality, and municipal climate adaptation policies. Examining these territorial variations provides insights into how research aligns with regional needs, ultimately contributing to the development of more effective policy solutions.
Building on existing literature, this study explores key questions regarding the thematic, disciplinary, and geographical dimensions of climate change doctoral research in Italy. Specifically, we address the following research questions:
-
i)
What are the main research topics addressed in Italian doctoral dissertations on climate change between 2008 and 2021?
-
ii)
How is climate change research distributed across different academic disciplines in Italy (as defined by SSDs), and which disciplines are most actively contributing to the field? Are certain scientific domains over- or under-represented in climate change research?
-
iii)
Do specific macro-regions focus on climate change issues that are particularly relevant to their local context?
Results
In Table 1, we compare the performance of several supervised learning models for classifying climate change-related doctoral dissertations in both English and Italian. The models include LinearSVC, Logistic Regression, GBM, Random Forest, Naïve Bayes, and MLP, each evaluated with and without the ROSE balancing technique. Accuracy metrics for each model are presented in the table. Given the variability in linguistic structure and content richness between English and Italian dissertations, we aim to assess how different models generalize across languages and whether oversampling influences classification accuracy.
LinearSVC (both weighted and with ROSE) consistently outperforms all other models, achieving the highest accuracy for Italian dissertations (0.97) and English dissertations (0.95). Logistic Regression performs similarly, with accuracy scores lower by just 0.01 (rounded up to 0.97), highlighting its robustness as a baseline model. This result indicates that LinearSVC effectively captures key discriminative features across both languages, making it the most reliable model for climate change-related dissertation classification. For English dissertations, Random Forest emerges as the best-performing model, achieving an accuracy of 0.96. This indicates that Random Forest is particularly effective for English text classification, although its performance on Italian dissertations is lower (0.89).
As a result, we identified a total of 1,178 PhD dissertations written in English addressing climate change (11.1% of English-language dissertations), compared to 318 in Italian (18.1% of Italian-language dissertations). We ran topic modelling separately for the two languages, and sentiment analysis only for English-language dissertations, due to the availability of pretrained sentiment dictionaries. Using this optimized Random Forest model and LinearSVC for English and Italian dissertations respectively, we then applied the classification out-of-sample to the entire corpus of Italian doctoral dissertations, systematically mapping climate change research across disciplines and geographical areas.
Knowing now which doctoral dissertations are climate change-related, thanks to our classification models, we explored how disciplinary patterns in climate change-related doctoral dissertations have evolved over time (Fig. 1). We analyzed the relative distribution of ERC (European Research Council) domains—Social Sciences and Humanities (SSH), Physical Sciences and Engineering (PE), and Life Sciences (LS)—from 2008 to 2021. Figure 1 presents the annual percentage shares of the three most frequent ERC domains, separately for dissertations written in Italian and English, along with an aggregated profile.
The analysis reveals a marked prevalence of SSH in Italian-language dissertations, where Social Sciences and Humanities consistently account for a larger share of climate change-related research compared to PE and LS. In contrast, English-language dissertations show a clear predominance of Physical Sciences and Engineering (PE), followed by Life Sciences (LS), while SSH is markedly underrepresented. These findings suggest that language is not merely a medium of communication, but also a structural dimension of how climate change research is framed in Italian doctoral education. Italian-language dissertations appear to engage more with climate change through social, political, and cultural lenses, possibly reflecting nationally oriented research agendas, while English-language dissertations are more rooted in technical and scientific approaches, likely oriented toward international academic audiences and journals.
Italian Language dissertations
Figure 2 presents the frequency distribution of the most common unigrams, bigrams, and trigrams identified in the corpus of PhD dissertations related to climate change. The layout of the figure is structured to allow a hierarchical interpretation of lexical patterns, with single-word terms (1-grams) positioned at the top left, two-word combinations (2-grams) at the top right, and three-word sequences (3-grams) centred below. These keywords provide a useful representation of linguistic structures in the text, as they highlight recurrent patterns in the corpus.
The most frequently occurring unigrams include ambientale (environmental), urbano (urban), processo (process), and energetico (energy-related). This highlights a strong emphasis on environmental and urban sustainability, as well as on processes related to energy systems and economic aspects (economico). The presence of words such as strumento (tool), valutazione (evaluation), and politico (political) suggests a methodological and governance-related component in climate change research. The bigrams provide a more contextualized understanding of key research themes. The most frequent phrase, cambiamento climatico (climate change), confirms the centrality of the topic. Other high-frequency bigrams include policy-related terms such as processo decisionale (decision-making process) and sostenibilità ambientale (environmental sustainability), along with technical aspects like efficienza energetica (energy efficiency) and consumo suolo (land consumption). The presence of pinus nigra and spazio urbano (urban space) suggests a research focus on ecosystem and urban studies. The trigram distribution further refines these themes, showing more specialized phrases such as emissioni gas serra (greenhouse gas emissions) and adattamento cambiamento climatico (climate change adaptation), indicating a research focus on both mitigation and adaptation strategies. Additionally, the keyword sistema edificio impianto (building system) suggests applications in architecture and urban planning.
The lexical patterns revealed in this analysis show the interdisciplinary nature of climate change research, spanning from energy transition and environmental impact assessment to urban studies and governance. The presence of both mitigation-related (e.g., efficienza energetica, sostenibilità ambientale) and adaptation-focused (e.g., adattamento cambiamento climatico) terms highlights the dual approach that characterizes climate-related research. Moreover, the strong occurrence of decision-making and evaluation-related terminology suggests an interest in evidence-based policy frameworks.
Figure 3 shows the yearly trends in log-scaled relative frequencies of selected climate-related keywords in Italian PhD abstracts (2008–2022). Values are normalized by the total keyword frequency in each year: Ambientale (Environmental), Urbano (Urban), Cambiamento Climatico (Climate Change) and Emissione Gas Serra (Greenhouse Gas Emission).
The term Ambientale (Environmental) has remained consistently present across the years, with minor fluctuations. This suggests that environmental concerns have been a stable component of Italian climate-related PhD research, with no drastic peaks or declines in attention. The term Urbano (Urban) exhibits a fluctuating but consistently high presence, indicating a sustained focus on urban sustainability, planning, and climate resilience. The curve for Cambiamento Climatico (Climate Change) shows a clear upward trend, with notable peaks around 2012, 2015, 2018 and 2021. This suggests an increasing recognition of climate change as a major research topic, potentially aligned with key international climate agreements (e.g., the Paris Agreement in 2015). The trajectory of Emissione Gas Serra (Greenhouse Gas Emission) reveals alternating periods of high and low emphasis, with particularly low values observed in 2012, 2016 and 2019.
Topic modelling
Topic analysis was conducted using two complementary approaches: a model based on Bidirectional Encoder Representations from Transformers (BERT) and K-means clustering applied to document embeddings. Specifically, we used ClimateBERT, a domain-specific language model fine-tuned on climate-related texts28, to extract contextual semantic representations of PhD dissertation abstracts. This approach enables the automatic identification and structuring of emerging climate-related topics.
In parallel, we applied the K-means clustering algorithm to the document embeddings to group abstracts based on semantic similarity, providing a data-driven categorization of research themes. The visualization in Fig. 4 represents the distribution of topics identified by BERT, where each point corresponds to a document and the clustering was determined by semantic similarity. The algorithm automatically assigns optimal clusters by detecting latent structures in the data, without the need for predefined categories. This approach ensures a data-driven categorization of climate change doctoral dissertations in Italian.
The visualization reveals distinct thematic areas emerging from the data. The cluster labels were initially in Italian and then translated into English for visualization purposes. In the upper left region, a cluster in pink represents research focused on climate change and biodiversity, covering topics such as environmental sustainability, biodiversity conservation and the broader impacts of climate change. On the right side of the graph, the blue cluster highlights studies related to energy, buildings and thermal systems, emphasizing advancements in energy efficiency, innovative building technologies and the optimization of thermal infrastructures. At the centre of the graph, a light blue cluster brings together research on food sustainability and ecological design, reflecting discussions on sustainable food production, green infrastructure and the integration of ecological principles in design practices. Meanwhile, in the bottom left section, a red cluster captures work on urban planning, landscape and urban development, focusing on spatial planning strategies, landscape architecture and approaches to enhancing urban sustainability. The spatial arrangement of these clusters suggests meaningful relationships among the topics, with some areas positioned closer together, indicating thematic overlaps and interdisciplinary connections. In particular, on the y-axis, we can observe the distance between more generic dimensions of climate change (Climate Change and Biodiversity) compared to applied themes of research related to urban landscapes (Urban Planning, Landscape, and City Design). Between these and the topic related to buildings and infrastructures (Energy, Buildings, and Thermal Systems), a cluster of documents is positioned at the intersection, focusing on the sustainable design of infrastructures (Sustainable Food and Ecological Design).
The next Table 2 presents the results of K-means clustering applied to Italian dissertations. To select the number of clusters, we used a perplexity-inspired heuristic adapted to the elbow method, which helped identify six as the optimal number of topics. This ensures a well-balanced thematic categorization without excessive fragmentation or overlap (Fig. 15). Each topic, as summarized in Table 2, is characterized by a set of representative keywords that highlight its thematic focus. The representative keywords are identified by analysing the most dominant terms within each cluster and further refined through feature importance analysis (using Random Forest) to ensure they are both frequent and thematically discriminative. The labels were assigned by an expert, taking into account the most important keywords identified in the text.
One of the main themes that emerged is related to the environment, biodiversity and land use (Topic 0). This category includes discussions on ecosystem management and the impact of human activities on natural resources, as suggested by keywords such as cambiamento climatico, biodiversità, uso suolo and vegetazione. Another topic (Topic 1) revolves around sustainability, environmental management and the green economy, which captures studies on sustainable practices, policy frameworks and eco-friendly decision-making processes. A separate cluster (Topic 2) focuses on industrial thermal processes and energy optimization, where research explores energy efficiency and thermal process improvements in industrial contexts. Keywords such as efficienza energetica, consumo energetico, scudo termico, and flusso termico suggest a strong focus on heat transfer, energy conservation and carbon footprint reduction. Similarly, another topic (Topic 3) is centred on environmental assessment, water and soil quality and chemical analysis, highlighting research related to pollution monitoring, chemical analysis and risk evaluation in environmental systems. Urban and rural development, spatial planning, and governance represents another key area (Topic 4), covering research on urbanization, spatial organization and governance mechanisms. The discussion in this cluster addresses urban sustainability, rural development, and policy interventions, with keywords including urbano, paesaggio, città, sviluppo rurale, and spazio pubblico. Lastly, a cluster (Topic 5) dedicated to sustainable building design and energy performance captures studies on green construction technologies and energy-efficient building practices. This topic encompasses discussions on sustainable architecture, energy planning and the use of innovative materials.
The comparison between BERT-based topic modelling and K-Means clustering reveals that BERT tends to generate broader, semantically rich clusters, whereas K-Means is more effective in identifying specialized subcategories, providing a finer-grained segmentation of the dataset. This difference might be partially influenced by the number of clusters: in our case, BERTopic produced four clusters, while K-Means identified six. One of the most evident similarities is the identification of topics related to climate change, biodiversity and land use, which appear consistently in both models. The BERT-based approach groups these elements under a general category, highlighting the relationship between climate change and biodiversity conservation, while K-Means isolates land use aspects more explicitly, focusing on specific environmental and ecological dynamics. A similar alignment is observed in the classification of urban and rural development, spatial planning, and governance, where both methods recognize the importance of urban sustainability, governance structures and territorial policies. However, divergences emerge in how each method structures topics related to energy, sustainability and environmental management. While K-Means clustering differentiates between industrial thermal processes and energy optimization as a distinct category, BERT integrates energy topics into a broader cluster that includes buildings, thermal systems and sustainable construction. This indicates that BERT tends to form clusters that encompass multiple related concepts under a larger thematic umbrella, whereas K-Means is more inclined to separate topics into finer-grained subdomains. Another key difference is in the treatment of sustainability and environmental management. K-Means groups these aspects under a general sustainability and green economy category, emphasizing decision-making processes, certifications and impact assessments. In contrast, BERT identifies a more specific cluster focusing on sustainable food and ecological design, demonstrating its ability to recognize conceptual themes that might be embedded within broader sustainability discussions. Further discrepancies arise in the categorization of environmental assessment, water and soil quality and chemical analysis. K-Means clustering identifies this as a distinct topic, highlighting research related to pollution monitoring, water resource evaluation and chemical risk assessment. In contrast, BERT does not generate a separate cluster for these studies but instead incorporates environmental considerations into broader sustainability and urban planning discussions. This suggests that the approach of BERT prioritizes contextual relationships between research topics, while K-Means is more attuned to specific thematic divisions based on textual similarity.
The bar chart (Fig. 5) shows the distribution of abstracts across the K-Means algorithm.
Topic 4 (Urban and Rural Development, Spatial Planning, and Governance) has the highest number of abstracts, indicating strong research interest in urban sustainability and governance. Conversely, Topics 0 and 5 have the fewest abstracts, suggesting that studies on environment, biodiversity and land use, as well as sustainable building design and energy performance are less represented in the dataset.
The next Fig. 6 illustrates the percentage distribution of different topics across Scientific Sisciplinary Sectors (SSD).
Topic 0 (Environment, Biodiversity and Land Use) is dominant in biological sciences (BIO/05, BIO/07), architectural and urban design (ICAR/14) and law (IUS/13), reflecting the focus of these fields on biodiversity conservation and land-use policies. Topic 1 (Sustainability, Environmental Management and Green Economy) appears strongly represented in economics and policy-related disciplines (SECS-P-07). Topic 2 (Industrial Thermal Processes and Energy Optimization) is more prevalent in chemistry (CHIM/02). Topic 3 (Environmental Assessment, Water and Soil Quality, and Chemical Analysis) is well represented in chemistry (CHIM/06), biology (BIO/07) and geosciences (GEO/05). Topic 4 (Urban and Rural Development, Spatial Planning, and Governance) is mostly found in architecture and planning disciplines (ICAR/12, ICAR/21). Finally, Topic 5 (Sustainable Building Design and Energy Performance) is strongly linked to engineering and architecture (ING-IND/11, ICAR/14), particularly in sectors related to infrastructure development, energy-efficient building design and sustainable construction practices.
The following Fig. 7 show the distribution of topics in Italian language across different geographical areas of University affiliations in Italy.
Topic 0 (Environment, Biodiversity, and Land Use) is prominently represented in all regions except the South, where it is nearly absent. This suggests that research on biodiversity conservation and land-use policies is more concentrated in the Centre, Northeast and Northwest. Topic 2 (Industrial Thermal Processes and Energy Optimization) appears in the Northwest, reflecting a strong focus on energy efficiency and industrial applications in this macro region. Similarly, Topic 3 (Environmental Assessment, Water and Soil Quality, and Chemical Analysis) is well represented in the Center and Northwest, highlighting a focus on environmental monitoring and resource management in these areas. A significant part of dissertations in the South and Northwest is dedicated to Topic 4 (Urban and Rural Development, Spatial Planning and Governance), suggesting that studies on urban sustainability, governance and spatial planning are more prevalent in these regions. Topic 5 (Sustainable Building Design and Energy Performance) is mainly addressed in the Centre and Northeast of Italy.
English language dissertations
Figure 8 presents the frequency distribution of the most common unigrams, bigrams, and trigrams extracted from the keywords associated with PhD dissertations written in English on climate change.
The most frequently occurring unigrams include energy, process, change, water and plant. This suggests a strong focus on energy systems, environmental changes and resource management. Terms like production, effect and species indicate an interest in ecological dynamics and industrial processes, while development and application highlight technological and applied aspects of climate change research. The bigrams provide a deeper contextual understanding. The most frequent phrase, climate change, confirms the core focus of the research. Other notable bigrams include policy-related expressions such as environmental impact, long term and point view, alongside technical and industrial aspects like renewable energy, fuel cell and energy consumption. The presence of study area and supply chain suggests an interest in regional analysis and industrial sustainability. For the trigram distribution, key expressions such as greenhouse gas emission and impact climate change indicate a focus on climate mitigation strategies, while renewable energy source and energy efficiency measure emphasize technological solutions for sustainability. The presence of wastewater treatment plant and municipal solid waste reflects research on urban sustainability and waste management. Additionally, decision make process and climate change adaptation suggest an interest in governance, decision-making, and adaptation strategies.
Figure 9 illustrates the temporal evolution of the relative frequency of four key terms in English-language PhD dissertations from 2008 to 2020: Energy, Water, Climate Change, and Greenhouse Gas Emission. The y-axis, on a logarithmic scale, indicates the frequency of each term normalized by the total keyword occurrences in the respective year.
The term Energy shows relatively minor fluctuations over time, suggesting a continuous and sustained focus on energy-related research. This aligns with the long-standing importance of energy systems in addressing climate challenges. The trajectory of Water exhibits a sharp increase between 2008 and 2010, followed by a period of relative stabilization. This initial growth may reflect increasing awareness of climate-induced hydrological changes, water resource management and drought resilience. The Climate Change curve shows a trend with cyclical fluctuations. Notable peaks around 2008, 2014, 2016 and 2019 may correspond to major international climate summits.
The trend for Greenhouse Gas Emission exhibits periodic peaks and declines, with a peak in 2008 and 2016.
The next Fig. 10 presents the topic clustering analysis based on BERT embeddings, illustrating the semantic relationships among different themes in climate change.
Figure 10 shows the results of BERTopic applied to climate-related research. In the top-right region, the corporate innovation and sustainability strategies cluster (cyan) emerges, reflecting research on business approaches to sustainability. Slightly below, the smart grids and renewable energy systems cluster (yellow) groups studies on energy transition and sustainable power generation. On the leftmost side, the climate risk assessment and disaster management topic (blue) is positioned, representing research on assessing and mitigating climate-related hazards.
Further down on the left, the air pollution and atmospheric monitoring cluster (green) appears, covering studies related to air quality assessment and environmental impact monitoring. Close to this, the advanced materials for clean energy and decarbonization cluster (orange) is present, focusing on innovative materials designed to enhance energy efficiency and reduce carbon emissions.
In the central-lower section, the biodiversity conservation and climate change impacts cluster (purple) is identified, gathering research on ecosystem resilience. In the bottom-left region, the sustainable transport and engine emissions reduction cluster (grey) groups documents addressing strategies to minimize the environmental impact of transportation through technological advancements.
Finally, on the right side, the sustainable building and energy efficiency cluster (brown) is located, representing research on green architecture, energy-efficient construction, and sustainable urban planning.
The next Table 3 shows the results of K-means clustering that were applied to English dissertations. The perplexity analysis indicates that the optimal number of topics for this dataset is eight (Fig. 16). The most prevalent terms in each cluster are examined to determine the representative keywords, which are then further optimized through feature importance analysis (using Random Forest) to make sure they are both frequent and thematically discriminative. The labels were directly assigned by an expert, taking into account the most important keywords.
The topic modelling analysis identifies eight distinct research areas. The first topic (Topic 0) is Ecology and Plant Sciences, which includes studies on plant growth, species interactions and microbial communities. Closely related to this field, Agriculture and Food Production (Topic 1) emerges as another key area, focusing on agricultural management, food production and livestock farming. The dataset also highlights research in Microbiology and Biodiversity (Topic 2), which explores microbial ecosystems, species adaptation and the role of microorganisms in environmental processes. Keywords such as bacterial community, microbial community, planktonic foraminifer, cold adapt and alien species point to investigations into biodiversity at a microscopic level and its implications for ecosystem stability. Another critical research area is Climate Policy and Regulations (Topic 3), which captures discussions on environmental governance, sustainability policies and legal frameworks. The presence of terms like climate change, sustainability, environmental social, life cycle, policy, and competition law suggests a strong focus on regulatory approaches to climate mitigation and the intersection between environmental policies and socio-economic factors. Water resource management and hydrological studies are also well represented in Hydrology, Water Resources and Population Studies (Topic 4), where research examines the impact of climate change on water systems and population dynamics.
Energy-related topics are divided into two distinct clusters. Energy Performance and Management (Topic 5) addresses energy efficiency, consumption patterns and sustainable resource use. The presence of keywords such as energy consumption, energy performance, heat pump and water use highlights efforts to optimize energy efficiency and implement long-term sustainable solutions. Meanwhile, Energy, Building Design, and Infrastructure (Topic 6) focuses on the integration of energy-efficient technologies within urban planning and building construction. Terms such as building energy, residential building, smart grid, and synthetic polymer suggest an emphasis on sustainable architecture, smart infrastructure, and energy-efficient design. Finally, Biodiversity and Cryosphere (Topic 7) represents a specialized research area dedicated to species distribution, genetic variation and the impact of climate change on polar and glacial environments.
In K-Means topic modelling, climate change is distributed across multiple categories, including Climate Policy and Regulations, Hydrology and Water Resources, and Biodiversity and Cryosphere. In contrast, BERT clusters climate-related discussions into broader thematic areas, such as Climate Risk Assessment and Disaster Management, highlighting its ability to capture semantic relationships within a single cluster. Energy-related topics also differ in their segmentation. K-Means separates Energy Performance and Management from Energy, Building Design and Infrastructure, making a clear distinction between energy efficiency in systems and energy use in construction. BERT, however, combines these aspects into Smart Grids and Renewable Energy Systems and Sustainable Building and Energy Efficiency, suggesting a stronger emphasis on technological solutions and energy integration. The treatment of biodiversity and ecology also varies. K-Means distinguishes Ecology and Plant Sciences from Microbiology and Biodiversity, identifying microbiology as a separate topic. Conversely, BERT merges biodiversity aspects into Biodiversity Conservation and Climate Change Impacts, integrating plant, microbial, and species-level studies within a broader ecological framework. Additionally, corporate sustainability and innovation is more explicit in BERT, with a dedicated cluster (Corporate Innovation and Sustainability Strategies), whereas K-Means does not create a distinct topic for corporate strategies. Air quality and pollution studies are captured differently as well. K-Means clusters these topics under Hydrology, Water Resources and Population Studies, linking water resources with climate impacts. BERT, on the other hand, forms a separate category for Air Pollution and Atmospheric Monitoring, indicating a finer distinction between different environmental monitoring approaches. Finally, the transport and emissions reduction topic is more explicitly recognized in BERT (Sustainable Transport and Engine Emissions Reduction), whereas K-Means does not form a distinct category for transportation.
The bar chart (Fig. 11) illustrates the distribution of abstracts across the identified research topics.
Energy, Performance and Management (Topic 5) has the highest number of abstracts, indicating a strong focus on energy efficiency, consumption patterns and sustainable resource use. This suggests a significant research interest in optimizing energy consumption and developing long-term sustainable solutions. Similarly, Agriculture and Food Production (Topic 1) also has a high representation, reflecting the importance of sustainable agricultural practices and food systems in environmental research. Other well-represented topics include Ecology and Plant Sciences (Topic 0) and Energy, Building Design, and Infrastructure (Topic 6). In contrast, Microbiology and Biodiversity (Topic 2) and Climate Policy and Regulations (Topic 3) are less represented, suggesting that research in these areas is more specialized or less frequently addressed.
The next Fig. 12 illustrates the distribution of topics across different Scientific Disciplinary Sectors (SSD). Each bar represents a specific SSD, with different colours indicating the percentage of abstracts associated with each topic.
The agricultural sciences (AGR/02, AGR/07, AGR/09, AGR/11, AGR/12, AGR/14, AGR/18) exhibit a high diversity of topics, with topics 0, 3, and 5 (Ecology and Plant Sciences; Climate Policy and Regulations; and Energy, Performance, and Management) being the most represented. In biological sciences (BIO/01, BIO/03, BIO/05, BIO/07), Topic 0 is also dominant, reinforcing the focus on ecological and plant-related aspects of climate change. However, these disciplines also show a notable presence of Topic 1 (Agriculture and Food Production), Topic 2 (Microbiology and Biodiversity), and Topic 4 (Hydrology, Water Resources and Population Studies). Chemistry-related SSDs (CHIM/01, CHIM/05, CHIM/06) are primarily associated with Topic 5 (Energy, Performance and Management) and Topic 7 (Biodiversity and Cryosphere), suggesting a focus on energy systems optimization, material performance under climate stress, and the chemical processes related to biodiversity conservation and cryospheric dynamics. In geosciences (GEO/04, GEO/05, GEO/08), there is a strong representation of Topic 6 (Energy, Building Design, and Infrastructure), highlighting the discipline’s contribution to the assessment of geophysical factors in energy and construction planning. Architecture and urban planning disciplines (ICAR/01, ICAR/02, ICAR/08, ICAR/09, ICAR/20, ICAR/21) are primarily associated with Topic 5 (Energy, Performance and Management) and Topic 6 (Energy, Building Design, and Infrastructure), reinforcing their engagement in the analysis of energy consumption and the design of energy-efficient infrastructure. The presence of Topic 6 in these disciplines suggests an increasing interest in integrating energy efficiency principles into architectural and spatial planning. The engineering fields (ING-IND/08, ING-IND/09, ING-IND/11, ING-IND/17, ING-IND/22, ING-IND/24, ING-IND/25, ING-IND/26, ING-IND/27, ING-IND/33, ING-IND/34, ING-IND/35, ING-INF/05) display a strong representation of Topic 1 (Agriculture and Food Production), Topic 5 (Energy, Performance and Management) and Topic 7 (Biodiversity and Cryosphere), reflecting their multifaceted role in developing technological solutions for sustainable agriculture, energy systems, and environmental monitoring. IUS/04 (Business Law) and MED/42 (Hygiene and public health) show a thematic focus on Climate Policy and Regulations, addressing topics such as life cycle assessment and human rights from both legal and public health perspectives. Finally, economics and social sciences (SECS-P/06, SECS-P/07, SECS-P/08, SECS-S/06, SPS/01, SPS/04) show a stronger presence of Topic 1 (Agriculture and Food Production), Topic 3 (Climate Policy and Regulations), and Topic 6 (Energy, Building Design, and Infrastructure), suggesting an interest in the economic and social implications of environmental policies.
The next Fig. 13 illustrates the distribution of topics for English dissertations across different macro geographical areas of Universities in Italy.
Perplexity Analysis for optimal topic selection in Italian dissertations using K-Means. The best choice for the number of topics appears to be 6, as increasing beyond this range results in rapidly increasing perplexity, which suggests a decline in model quality. At 6 topics, the slope of perplexity starts changing, marking the transition point between coherence and unnecessary complexity.
In the Northwest, research is well-balanced across multiple themes, with a strong presence of studies on Ecology and Plant Sciences. This suggests a significant focus on plant growth, species interactions, and biodiversity conservation, likely influenced by the region’s natural landscapes and environmental research initiatives. The Northeast follows a similar trend but places greater emphasis on Agriculture and Food Production and Hydrology, Water Resources, and Population Studies. This is underscored by the fact that among the representative keywords for Topic 4 is “Northern Italy”, emphasizing the regional context. The strong presence of agricultural research reflects the importance of farming, livestock production, and food sustainability in this area, while the attention to hydrology and water resources indicates a focus on climate change impacts (e.g. drought), groundwater management, and flood risk assessment, particularly relevant for Northern Italy. Moving to the Centre of Italy, the research landscape is characterized by a diverse range of topics, with a notable focus on Energy, Performance, and Management. This suggests that studies in this region are largely centred on energy efficiency, resource consumption, and sustainable technological advancements, possibly linked to urban infrastructure and industrial applications. A similar trend is observed in the South, where Energy, Performance, and Management also emerges as a dominant topic, with an even greater proportion of research dedicated to this area. This indicates a strong regional interest in energy efficiency and long-term sustainability, likely driven by efforts to integrate renewable energy solutions and optimize resource use in response to climatic conditions. The Islands, in contrast, show a predominant focus on Climate Policy and Regulations and Hydrology, Water Resources, and Population Studies. This suggests that research in these regions is particularly concentrated on sustainability policies, environmental governance, and climate adaptation strategies, alongside studies on water management and ecosystem services, which are critical for island territories facing challenges such as water scarcity and environmental vulnerability.
Sentiment analysis
The next Fig. 14 presents the results of Climate-BERT sentiment analysis on English-language dissertations, examining how different perspectives on climate change have evolved over time. The main graph shows the overall distribution of sentiments—categorized as “Challenges”, “Neutral” and “Solutions”—from 2008 to 2021, while the smaller inset graph focuses specifically on BIO/07 (Biology), which is the most frequently occurring Scientific Disciplinary Sector (SSD) in English-language dissertations discussing climate change. Dissertations categorized as “Challenges” tend to highlight the risks, negative impacts, or unresolved issues surrounding climate change. Those identified as “Solutions” present more optimistic or action-oriented perspectives, often focusing on proposed interventions, mitigation strategies, or innovations. Lastly, “Neutral” texts discuss climate change in a more descriptive or analytical way, without explicitly presenting it as a problem to solve or a solution to implement.
Perplexity Analysis for optimal topic selection in English dissertations using K-Means. The optimal number of topics is 8, as this range maintains low perplexity while preventing excessive topic fragmentation. At 8 topics, the slope of perplexity starts changing, marking the transition point between coherence and unnecessary complexity.
In the main box of Fig. 14, the bars illustrate the percentage distribution of the three sentiment categories over time. The yellow section represents “Solutions” the green section denotes “Neutral” perspectives, and the purple section indicates “Challenges”. It is possible to observe that neutral and solutions-oriented perspectives consistently hold the highest share across the years, while the share of challenges remains relatively stable. The proportion of solutions-oriented sentiment is particularly high between 2008 and 2010, before slightly decreasing and maintaining a steady presence at around 35% of the total dissertations in each subsequent year. This suggests that academic discourse has maintained a focus on actionable solutions, albeit with a slightly moderated emphasis over time. The right chart, which isolates BIO/07 (Biology), shows a predominance of dissertations focusing on solutions to climate change. Neutral perspectives also have a significant share, while there are no dissertations explicitly addressing challenges within the biological field. This pattern indicates that biological research is heavily oriented towards finding ways to address climate-related issues, such as biodiversity conservation, ecosystem restoration and adaptation strategies. The consistently high representation of solutions-oriented and neutral perspectives over time suggests that climate change research in academia has gradually shifted towards a more proactive and action-focused approach.
Discussion and conclusion
The purpose of this study was to investigate the trends regarding climate change research in the Italian academic system, as reflected in doctoral dissertations defended between 2008 and 2021. By analysing a large dataset of dissertation abstracts, we aimed to identify key research themes, disciplinary contributions and macro-regional orientation. Our findings reveal a diverse range of topics, reflecting the multifaceted nature of climate change. These include studies on the physical impacts of climate change (e.g., sea-level rise, extreme weather events), mitigation strategies (e.g., renewable energy, sustainable agriculture), and adaptation measures. However, our analysis suggests that climate change has not been a widely addressed topic in doctoral research during the period considered. Only about 13% of the analyzed dissertations explicitly focused on climate change, indicating that despite its global relevance, the topic remains relatively underrepresented in the Italian doctoral education system. While some disciplines, such as biology and engineering, have made significant contributions, other fields appear to be less engaged with climate-related research. This underrepresentation may stem from multiple factors, such as the lower prioritization of climate change within certain disciplinary agendas or structural issues like unequal funding across research fields. These dynamics point to a broader gap in academic engagement with climate challenges, which future studies could investigate more systematically. There is a strong representation of environmental and biodiversity-related research in agricultural and biological sciences. Industrial and engineering disciplines are more closely associated with energy efficiency, thermal process optimization and sustainable infrastructure development. Similarly, architecture and urban planning disciplines prioritize urban sustainability and spatial planning, reflecting their role in designing energy-efficient buildings and resilient urban environments. The presence of environmental assessment and pollution monitoring in chemistry and geosciences highlights the interdisciplinary nature of environmental risk analysis. While chemistry and geosciences contribute through technical assessments of soil, air and water quality, social sciences and economics integrate the policy and governance dimensions. A notable trend is the strong alignment between economics, law and environmental governance. The concentration of research on regulatory frameworks, sustainability policies, and population studies suggests that legal and economic disciplines are engaged in shaping climate policies and ensuring that environmental innovations translate into actionable regulations. However, we also observed a potential underrepresentation of the social sciences and humanities compared to the other scientific fields. The number of climate-related dissertations in the social sciences is remarkably low and primarily concentrated in the areas of Agriculture and Food Production, Urban Development, and Climate Policy and Regulations.
Our findings suggest that research interests often align with regional specificities. The North (Northwest and Northeast) places significant emphasis on ecological and agricultural research, particularly in biodiversity conservation, food production and water resource management. This aligns with the region’s strong agricultural and industrial base, where climate adaptation strategies for farming and ecosystem resilience are crucial. The Northeast’s attention to hydrology and flood risk assessment further reflects the increasing vulnerability of Northern Italy to extreme weather events, reinforcing the need for research on water resource sustainability and climate adaptation. Keywords associated with Topic 4 in the English corpus support this observation. In contrast, the Centre and South of Italy exhibit a clear focus on energy-related topics, particularly in Energy, Performance and Management. The strong presence of research in this area suggests that these regions are more engaged in developing energy-efficient solutions, integrating renewable energy, and optimizing resource consumption, likely driven by climatic conditions, urban infrastructure demands and energy transition policies. The Islands present a unique research profile, dominated by Climate Policy and Regulations and Hydrology, Water Resources and Population Studies. The strong emphasis on policy and governance suggests a focus on regulatory frameworks and decision-making processes for managing climate risks, rather than purely technical or engineering solutions.
These findings reinforce the idea that climate change research is not uniform across Italy but instead reflects localized priorities and challenges. While the North engages in land, water and biodiversity conservation, the Centre and South prioritize technological and energy solutions, and the Islands focus on governance and climate adaptation policies. This regional specialization could provide opportunities for cross-regional collaboration, where different areas contribute expertise in complementary aspects of climate mitigation and adaptation, fostering more integrated and effective sustainability strategies across the country. Future research should include a more detailed content analysis of doctoral dissertations to better understand this aspect and to detect the extent to which regional differences in the distribution of topics might be related to local conditions.
One limitation of our research is that we maintained the original language of the abstracts, which caused an underrepresentation of Italian-language dissertations, with implications for the estimation of topics. Yet, the topics that emerged showed strong coherence, representing the ongoing discourses in the Italian academic landscape.
These results carry several important implications. First, they underscore the growing awareness of climate change among early-career researchers in Italy and their commitment to addressing this global challenge. Second, they highlight the importance of interdisciplinary collaboration to tackle the complex and multifaceted nature of climate change. Finally, they suggest the need for tailored research and policy solutions that address the specific challenges faced by different regions of Italy.
This study provides a foundation for future research. Further investigation into the under-representation of certain disciplines, such as the social sciences and humanities, as well as observed regional differences, could help foster a more comprehensive and socially informed approach to climate change research and policy. Future research could also look at what kinds of methods are used in climate-related dissertations. This could help build a broader understanding of how local contexts influence the way climate change is explored and how research is tailored to address it.
Materials and methods
Dataset
The dataset analyzed in this study comprises doctoral dissertations from the National Central Library of Florence, which documents Italian PhD dissertations from 1985 to the present in UNIMARC (Universal Machine Readable Cataloguing) format. UNIMARC is a widely adopted bibliographic standard that facilitates the accurate cataloguing and registration of documents in bibliographic databases. It organizes metadata into specific fields and subfields, including year, language, place of publication and institutional affiliations (International Federation of Library Associations and Institutions (IFLA), “UNIMARC Permanent Committee” [Online]. Available: https://www.ifla.org/units/unimarc-rg/. [Accessed: 14 February 2025]).
For this research, a subset covering the years 2008–2021 was selected, consisting of 128,955 records. Since some dissertations did not include an abstract or lacked complete information, our final sample comprises 74,394 records. Considering that approximately 8,000 PhD graduates complete their studies in Italy each year, the total number of doctoral dissertations produced over time is remarkably high.
To ensure dataset consistency and relevance, non-dissertation records were removed before proceeding with further cleaning steps. These records were identified as they did not conform to the expected structure of doctoral dissertations. An initial structural analysis was conducted to identify key fields and subfields relevant to this study. This analysis revealed inconsistencies such as university names, institutional affiliations and departmental affiliations being scattered across multiple fields, duplicated or recorded in inconsistent formats. In some cases, data were missing from expected fields but appeared elsewhere, likely due to variations in data entry practices over time.
To address these issues, we developed a data-cleaning pipeline focused on standardization and structure refinement. The first step resolved HTML encoding errors to ensure accurate text representation. We then extracted and harmonized key elements, particularly SSD classification codes and university affiliations, using pattern-based extraction techniques.
Given the dataset’s complexity, we employed regular expressions (regex) to systematically retrieve relevant UNIMARC fields, including year, language, title, author, contributors, Italian PhD cycle, SSD classification, abstract and institutional details. Regex proved effective in isolating information but required further refinement to correct formatting anomalies.
For author and contributor names, normalization addressed formatting inconsistencies such as misplaced commas, extra spaces, and special characters. We lowercased the terms, removed extraneous symbols, and ensured uniform spacing. Contributor names underwent similar processing to maintain a uniform structure.
SSD classification codes and descriptions often exhibited textual discrepancies, typographical errors, and formatting inconsistencies. To resolve these issues, we applied a mapping-based standardization process, aligning the extracted SSD codes with the official classification system defined by the Italian Ministry of University and Research (MIUR), according to the classification conventions valid at the time of each dissertation (Università di Padova, “Corrispondenza tra vecchi e nuovi settori scientifico-disciplinari (SSD)”. [Online]. Available: https://www.unipd.it/sites/unipd.it/files/allegato_C.pdf [Accessed: 14 February 2025]).
University names posed additional challenges due to their distribution across multiple fields. Since different fields could contain overlapping or incomplete information, we established a precedence strategy based on UNIMARC’s cataloguing rules to ensure that the most complete and accurate institutional information was retained. After extraction, we further refined university names using mapping files to reconcile spelling variations, abbreviations, and inconsistent naming conventions. This normalization step was essential to unify institutional references—particularly for Italian universities—and to remove duplicates that could compromise dataset integrity. After applying the procedure described above and excluding dissertations not in English or Italian (a negligible number), we obtained a total of 13,132 dissertations in English and 5,771 in Italian with all records fully observed.
These systematic cleaning procedures enhanced data consistency and reliability, facilitating a more precise examination of Italian doctoral research trends.
Machine learning classification process: identify climate change related research
The initial step of our analysis involved developing a machine learning (ML)-based classification tool to systematically identify doctoral dissertations related to climate change. To achieve this, we employed text classification, a process that automatically assigns documents to predefined categories based on their content29. This task was accomplished through a supervised learning approach, where a decision function is trained using manually labelled data (350 for both Italian and English abstracts), evaluated on a test set, and then applied to classify dissertations whose category is unknown. To create the training and test sets, we randomly selected a subset of dissertations from the initial dataset.
Each dissertation was classified as climate change-related or not by two independent subject-matter experts based on titles and abstracts of the dissertations, available in both Italian and English. In cases of ambiguity or disagreement, classifications were discussed and resolved collaboratively.
This expert-labelled dataset was then split into training (70%) and test (30%) sets, following standard ML best practices30. The training set was used to develop the classification model, while the test set was used to evaluate its predictive performance.
Once the labelled dataset was established, feature extraction was performed using text mining techniques31. This process transformed textual data into numerical vectors, enabling statistical analysis. The following text pre-processing steps were applied to ensure data consistency and optimize model performance for Italian and English dissertations:
-
Text normalization (conversion to lowercase, spell checking, contraction expansion).
-
Removal of numbers, punctuation, and special characters (e.g., @, °, #, §, +).
-
Stopword removal, excluding common words with little discriminant meaning.
-
Lemmatization, reducing words to their base form to standardize vocabulary.
-
N-gram extraction, including n-grams, specifically bigrams (2-grams) and trigrams (3-grams), to capture more complex word sequences and contextual relationships.
After pre-processing, a TF-IDF was created (Term Frequency - Inverse Document Frequency), representing each dissertation as a vector of term frequencies. We constructed 1-gram, 2-gram and 3-gram representations, retaining only those with a minimum frequency threshold to reduce sparsity (n-grams appearing in at least 2 documents). This process resulted in a set of predictive features used to train the classification model. The classification task aimed to map textual features to a binary classification label, where 1 = climate change-related dissertation and 0 = non-climate change dissertation. Several machine learning models were trained and evaluated to determine the most accurate classifier. The algorithms tested included:
-
Naïve Bayes (NB): Assumes feature independence within each class32,33.
-
Linear Support Vector Classification (LinearSVC): It is specifically designed for handling linearly separable data34,35.
-
Logistic Regression (LR): Estimates class probabilities using the logistic function and is widely used for binary classification36.
-
Random Forest (RF): An ensemble of decision trees trained on different feature subsets37,38.
-
Gradient Boosting Machine (GBM): Iteratively corrects errors using boosted decision trees39.
-
Neural Networks (NN): Uses interconnected layers to learn complex patterns in text40,41.
Since climate change-related dissertations were significantly fewer than non-related ones, we addressed class imbalance using two strategies:
-
Class weighting – Assigning higher misclassification penalties to the minority class.
-
Random Over-Sampling Examples (ROSE) – Generating synthetic samples to balance the dataset42.
To evaluate the models, we applied 10-fold cross-validation, tuning hyperparameters to maximize the Area Under the Curve (AUC) of the Receiver Operating Characteristics (ROC) curve38,40. Among the tested models, Random Forest combined with class weighting achieved the highest predictive accuracy for English dissertations (an accuracy of 0.95), while LinearSVC is the best model for Italian dissertations (an accuracy of 0.97).
Topic modelling selection
After identifying the full set of climate change-related dissertations, we conducted topic modelling and sentiment analysis to explore the key themes and narratives emerging in the analyzed corpus. Specifically, these analyses aimed to:
-
Identify dominant research topics within climate change-related dissertations.
-
Assess whether dissertations focus more on climate challenges or potential solutions.
-
Evaluate disciplinary and regional variations in climate research themes.
To identify distinct research topics within climate change-related doctoral dissertations, we applied two approaches, BertTopic modelling and K-means clustering, the latter being an unsupervised machine learning algorithm that partitions data into clusters based on similarity measures43. This combination was chosen not only to understand the different topics emerging in climate change research but also to assess the degree of overlap among them. Climate research is interdisciplinary, and topics often share conceptual and methodological similarities. By combining BERT-based embeddings with unsupervised clustering, we aimed to capture these connections, identifying both distinct research themes and areas of thematic convergence. This approach provides a multi-layered perspective on the structure of climate research, going beyond simple categorization. BERT leverages deep contextual embeddings to detect hidden semantic relationships, while K-means clustering offers an accessible and interpretable way to refine topic boundaries.
BERT was implemented using ClimateBERT44, a fine-tuned version of DistilRoBERTa45 specifically designed for climate-related text classification. BERTopic was then used to generate semantically coherent clusters, enhancing interpretability and thematic consistency46. ClimateBERT uses semantic embeddings, dimensionality reduction (e.g., UMAP), and density-based clustering (e.g., HDBSCAN) to automatically infer the optimal number of topics without needing to set it a priori.
Meanwhile, K-means clustering was applied as a complementary method to categorize topics efficiently. One of the key challenges in clustering is determining the optimal number of clusters (k). To achieve this, we followed a two-step approach:
-
Elbow Method – We ran the K-means algorithm for values of k ranging from 1 to 100, plotting the Sum of Squared Errors (SSE) against k. In our study, we conducted a perplexity analysis to determine the optimal number of clusters for the k-means algorithm in topic extraction. The optimal number of clusters was identified at k = 6 and 8 for both Italian and English dissertations respectively (see Figs. 15 and 16), where the SSE curve showed an “elbow”, indicating diminishing returns in cluster separation beyond that point.
-
Refinement through Keyword Analysis and Domain Knowledge – After extracting keywords (mainly unigrams and bigrams) from each cluster, we manually reviewed them to assess semantic similarity and coherence. This refinement process led to the merging of overlapping clusters, reducing the final number of distinct themes.
For each cluster, we built a binary classification model using Random Forest. Abstracts in the target cluster were labelled as “1” (positive), while all others were labelled as “0” (negative). The classifier was trained to distinguish between these two groups, and the resulting feature importance scores revealed which keywords were most effective at identifying abstracts in that cluster, thereby highlighting its key themes. This classification process allowed us to determine the most important keywords within each cluster, as well as the relative contribution of specific terms to the thematic identity of each group. This procedure ensured that clusters were well-defined and interpretable, enabling a comprehensive thematic classification of climate change-related dissertations. In K-means, each abstract is assigned to a single cluster based on its closest centroid, meaning that every abstract belongs exclusively to one cluster. This is different from topic modelling approaches, where documents can have a probability distribution over multiple topics.
Finally, we used sentiment analysis with ClimateBERT, a pre-trained language model that specializes in climate-related texts, to evaluate how dissertations about climate change frame their discourse, focusing on issues, neutral viewpoints or solutions44. Given that ClimateBERT for sentiment analysis has been trained on English-language texts, this sentiment analysis was conducted exclusively on dissertations written in English (1,178). Abstracts were divided into three sentiment classes by the model:
-
challenges (such as dangers, weaknesses and opposing effects of climate change).
-
neutral (e.g., theoretical discussions or descriptive investigations).
-
solutions (such as policy suggestions, mitigating techniques, and technology developments).
In order to identify patterns in climate change research, we mapped sentiment distributions across Italian academic disciplines, using SSD (Scientific Disciplinary Sector) categories as a reference framework. This enabled us to investigate whether specific scientific disciplines emphasize particular storylines (for example, engineering dissertations focusing on solutions, and environmental sciences emphasizing on risks).
Data availability
All data used in this study are publicly available from open-access archives. The code and labelled datasets used for the machine learning analyses are available upon reasonable request by contacting antonio.zinilli@cnr.it.
References
IPCC, 2021: Climate Change 2021: The Physical Science Basis. Contribution of Working Group I to the Sixth Assessment Report of the Intergovernmental Panel on Climate Change (eds Masson-Delmotte, Zhai, V. P., et al.) (Cambridge University Press, 2021). https://doi.org/10.1017/9781009157896
UNFCCC. 2016. Paris agreement. United Nations Treaty Collection. United Nations (2020) Climate change. (accessed 3 January 2025). https://unfccc.int/sites/default/files/resource/UNFCCC_Annual_Report_2020.pdf (2016).
Spano, D. et al. Analisi del rischio. I cambiamenti climatici in sei città italiane. Centro Euro-Mediterraneo sui cambiamenti climatici, Lecce. (2021).
Ghinoi, S. & Steiner, B. The political debate on climate change in italy: A discourse network analysis. Politics Gov. 8 (2), 215–228 (2020).
Biancalana, C., Ladini, R. & Visconti, F. Climate change in italy: towards the politicization of an issue. Italian Political Sci. 18 (3), 177–193 (2023).
Liu, B. Sentiment Analysis and Opinion Mining (Springer Nature, 2022).
de Gouveia, M. & Inglesi-Lotz, R. Examining the relationship between climate change-related research output and CO2 emissions. Scientometrics 126, 9069–9111. https://doi.org/10.1007/s11192-021-04148-x (2021).
Felton, A. et al. Climate change, conservation and management: an assessment of the peer-reviewed scientific journal literature. Biodivers. Conserv. 18, 2243–2253 (2009).
Liu, F. Retrieval strategy and possible explanations for the abnormal growth of research publications: re-evaluating a bibliometric analysis of climate change. Scientometrics 128, 853–859. https://doi.org/10.1007/s11192-022-04540-1 (2023).
Hiltner, S. Limited attention to climate change in US sociology. The Am. Sociol. 1–25 (2024).
Schipper, E. L. F., Dubash, N. K. & Mulugetta, Y. Climate change research and the search for solutions: rethinking interdisciplinarity. Clim. Change. 168 (3), 18 (2021).
Wang, B., Pan, S. Y., Ke, R. Y., Wang, K. & Wei, Y. M. An overview of climate change vulnerability: a bibliometric analysis based on web of science database. Nat. Hazards. 74, 1649–1666 (2014).
Jankó, F., Vancsó, P., Móricz, N. & J., & Is climate change controversy good for science? IPCC and contrarian reports in the light of bibliometrics. Scientometrics 112 (3), 1745–1759 (2017).
Wang, Z., Zhao, Y. & Wang, B. A bibliometric analysis of climate change adaptation based on massive research literature data. J. Clean. Prod. 199, 1072–1082 (2018).
Schwechheimer, H. & Winterhager, M. Highly dynamic specialities in climate research. Scientometrics 44, 547–560 (1999).
Li, J., Wang, M. H. & Ho, Y. S. Trends in research on global climate change: A science citation index Expanded-based analysis. Glob. Planet Change. 77 (1–2), 13–20 (2011).
Haunschild, R., Bornmann, L. & Marx, W. Climate change research in view of bibliometrics. PloS one 11(7), e0160393 (2016).
Sangam, S. L. & Savitha, K. S. Climate change and global warming: A scientometric study. COLLNET J. Scientometrics Inform. Manage. 13 (1), 199–212 (2019).
Fu, H. Z. & Waltman, L. A large-scale bibliometric analysis of global climate change research between 2001 and 2018. Clim. Change. 170 (3), 36 (2022).
Reale, E. & Spinello, A. O. Government R&D funding policy for academic research in italy: are there incentives for climate change solutions?? In Higher Education Policy for Tackling Climate Change: Drivers, Dynamics, and Effects 89–117 (Springer Nature Switzerland, 2025).
Ziman, J. Real Science: What It Is, and What It Means (Cambridge University Press, 2000).
Freeman, C. The ‘national system of innovation’ in historical perspective. Camb. J. Econ. 19 (1), 5–24 (1995).
Zinilli, A., Pierucci, E. & Reale, E. Organizational factors affecting higher education collaboration networks: evidence from Europe. High. Educ. 88 (1), 119–160 (2024).
Agrawala, S. Context and early origins of the intergovernmental panel on climate change. Clim. Change. 39, 605–620 (1998).
Victor, D. Climate change: embed the social sciences in climate policy. Nature 520, 27–29 (2015).
Franceschini, S., Faria, L. G. & Jurowetzki, R. Unveiling scientific communities about sustainability and innovation. A bibliometric journey around sustainable terms. J. Clean. Prod. 127, 72–83 (2016).
Pasgaard, M., Dalsgaard, B., Maruyama, P. K., Sandel, B. & Strange, N. Geographical imbalances and divides in the scientific production of climate change knowledge. Glob. Environ. Change. 35, 279–288 (2015).
Shi, H., Livescu, K. & Gimpel, K. Substructure substitution: Structured data augmentation for NLP. arXiv preprint arXiv:2101.00411. (2021).
Yang, Y. & Liu, X. A re-examination of text categorization methods. 42–49 (1999).
Hastie, T., Tibshirani, R., Friedman, J. H. & Friedman, J. H. The elements of statistical learning: data mining, inference, and prediction Vol. 2, 1–758 (springer, 2009).
Feinerer, I., Hornik, K. & Meyer, D. Text mining infrastructure in R. J. Stat. Softw. 25 (5). https://doi.org/10.18637/jss.v025.i05 (2008).
Gareth, J., Witten, D., Hastie, T. & Tibshirani, R. An Introduction to Statistical Learning: with Applications in R (Springer, 2013).
Nuzzolese, A. G. et al. Do altmetrics work for assessing research quality? Scientometrics 118 (2), 539–562 (2019).
Bologna, F., Di Iorio, A., Peroni, S. & Poggi, F. Do open citations give insights on the qualitative peer-review evaluation in research assessments? An analysis of the Italian National scientific qualification. Scientometrics 128 (1), 19–53 (2023).
Poggi, F. et al. Predicting the results of evaluation procedures of academics. PeerJ Comput. Sci. 5, e199 (2019).
Hosmer, D. W. Jr, Lemeshow, S. & Sturdivant, R. X. Applied Logistic Regression (Wiley, 2013).
Breiman, L. Random forests. Mach. Learn. 45 (1), 5–32. https://doi.org/10.1023/A:1010933404324 (2001).
Zinilli, A. & Cerulli, G. Link prediction and feature relevance in knowledge networks: A machine learning approach. Plos One. 18 (11), e0290018 (2023).
Friedman, J. H. Greedy function approximation: a gradient boosting machine. Ann. Stat. 29 (5), 1189–1232 (2001).
Resce, G., Zinilli, A. & Cerulli, G. Machine learning prediction of academic collaboration networks. Sci. Rep. 12 (1), 21993 (2022).
Ripley, B., Venables, W. & Ripley, M. B. Package ‘nnet’. R Package Version. 7 (3–12), 700 (2016).
Lunardon, N., Menardi, G. & Torelli, N. ROSE: a package for binary imbalanced learning. R J., 6(1) (2014).
Kaushal, A., Acharjee, A. & Mandal, A. Machine learning based attribution mapping of climate related discussions on social media. Sci. Rep. 12 (1), 19033 (2022).
Bingler, J. A., Kraus, M., Leippold, M. & Webersinke, N. How cheap talk in climate disclosures relates to climate initiatives, corporate emissions, and reputation risk. J. Banking Finance. 164, 107191 (2024).
Sanh, V., Debut, L., Chaumond, J. & Wolf, T. DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. ArXiv Preprint (2019). arXiv:1910.01108.
Grootendorst, M. BERTopic: neural topic modeling with a class-based TF-IDF procedure. ArXiv Preprint arXiv :220305794. (2022).
Acknowledgements
This work was supported by the FOSSR - Fostering Open Science in Social Science Research (CUP B83C22003950001) project, funded by the EU - NextGenerationEU under NRRP Grant agreement n. MUR IR0000008.
Funding
This research received funding from the EU - NextGenerationEU under the NRRP Grant agreement n. MUR IR0000008. The funding body had no role in the design of the study, data collection, analysis, interpretation, or manuscript preparation.
Author information
Authors and Affiliations
Contributions
A.Z. conceived the study, performed the statistical analyses, and supervised the overall research activities. G.T. and F.P. curated and preprocessed the data. A.Z, G.G.T., F.P., A.G.N., L.G., R.P., M.M., C.F.L., M.C., and S.Z. contributed to the methodological design, the implementation of semantic tools, and the integration of AI-based components. A.Z. also contributed to the literature review and writing of the manuscript. All authors reviewed, discussed, and approved the final version of the manuscript.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Zinilli, A., Tuccari, G.G., Poggi, F. et al. Anatomy of climate change research in Italian doctoral dissertations using a machine learning approach. Sci Rep 15, 38095 (2025). https://doi.org/10.1038/s41598-025-17307-4
Received:
Accepted:
Published:
Version of record:
DOI: https://doi.org/10.1038/s41598-025-17307-4















