Introduction

Neutrons - by virtue of their electrical neutrality and the nuclear character of their interaction with matter - provide a non-invasive and non-destructive means to probe the structure and dynamics of materials from atomic to mesoscopic or even macroscopic length scales. Thanks to their magnetic dipole, they can also probe magnetism. Neutrons can be employed under extreme and versatile environments, and generally are considered a unique and indispensable tool to study all forms of condensed matter. Instrumentation for neutron research is mostly available at large scale facilities, where each neutron source serves several beamlines with instrumentation dedicated to serve specific scientific fields. Inherently, such facilities serve a broad range of scientific applications ranging from imaging at centimeter to micrometer scale, down to macromolecular and sub-atomic distances, and femtosecond to microsecond dynamics. The use of neutrons as scientific probes has evolved from solid state physics and chemical crystallography to a much broader range of topics, including soft matter, nanotechnology, biology, cultural heritage and engineering1,2. As a consequence of the breadth of topics covered, the scientific and societal impact of the community of ‘neutron scientists’ is significant but difficult to quantify and visualize.

To address this challenge, we analyze the scientific output of the European neutron science community using Natural Language Processing (NLP) and machine learning, to render an overview of the rather broad European community of scientists that use the neutron research infrastructures for their scientific research. This study is primarily focused on the specific results for neutron scattering, while also demonstrating the novelty and applicability of our machine learning approach. Through this novel approach we are able to showcase the evolution and distribution of the community and the main foci of research in a quantitative manner. The outcome of this approach is shown to be consistent with the findings reported by others3,4,5.

The approach to use open-source software to analyze a scientific community through its publications originated from our participation as European Neutron Scattering Association (ENSA1), in the Horizon2020 Brightness2 project6. The development of the analysis tools served to describe the needs of the community for the long-term sustainability of the European Spallation Source (ESS7), . We extended these tools to use semi-supervised machine learning to quantitatively describe the community and their scientific focus. While we are aware that there are other approaches to this type of analysis, we argue that our method allows us to render an overview of a community and its impact in an unbiased and quantitative manner, even when the community is rather disperse, broad and heterogeneous.

Method

Data collection

The metadata of the publications were obtained from the Scopus database using the Scopus Search API, facilitated by the pybliometrics8 python package. To compile a comprehensive database of publications related to neutron research, a straightforward term search was conducted on titles, keywords, and abstracts using the term “neutron” (“TITLE-ABS-KEY(neutron)”). To validate the relevance of the retrieved entries, two additional databases of publication metadata were collected, specifically including entries associated with the ILL affiliation ID (60007109) and the ISIS Neutron and Muon Source affiliation ID (60001724). On the basis of the collected metadata, we created a Venn diagram to validate our search approach.

Data filtering – supervised machine learning

To filter out publications containing the term “neutron” that were deemed irrelevant to the neutron scattering community (e.g., references to ‘neutron stars’), we employed a supervised machine learning approach. This process aimed to classify publications into two categories: “relevant” and “irrelevant” to the neutron scattering community.

For the development of a classification model, we enlisted neutron scientists from the European Neutron Scattering Association (ENSA) to manually label a dataset of 13,139 publications from the years 2021 and 2022 as either “from the community (relevant)” or “not from the community (irrelevant).” These community representatives were provided with the metadata of each publication in the dataset. The representatives also could inspect the full text of the publication via its DOI, though the machine learning algorithm itself only utilized metadata.

The labeled dataset was then split into two subsets: a training set, which contained 90% of the labeled publications (11,779 entries), and a test set, which contained the remaining 10% (1,360 entries). We used the training subset to train a classification model, employing a stochastic gradient descent classifier9 from the scikit-learn Python package10. The model was evaluated on the test subset and achieved an accuracy of 89%.

After training, we applied this model to classify the entire corpus of “neutron” publications, which had been initially generated by a Scopus query based on the single term “neutron.” Using this supervised machine learning model allowed us to reduce the corpus size by removing entries classified as “irrelevant” to the neutron scattering community. Only publications deemed “relevant” by the model were included in the subsequent analysis.

Statistics

From the filtered database, the main publication trends were extracted, including the number of publications per year, the average number of (co-)authors and (co-)affiliations per publication per year, as well as the total number of (co-)authors and (co-)affiliations per year. For our analysis, we made no distinction between the main author, corresponding author or co-authors. Every person included within the metadata, was thus considered ‘author’ for the purposes of the subsequent analyses.

NLP topic modelling – unsupervised machine learning

An unsupervised machine learning technique called Latent Dirichlet Allocation (LDA)11 was employed to identify common topics frequently occurring in the neutron publications. The number of topics used in the LDA analysis ranged from 2 to 50, and the optimal number of topics was determined based on the extent of topic separation and avoidance of excessive fragmentation.

For the analysis, the titles, keywords, and abstracts of the publications were combined into a single text entry. These combined text entries were then tokenized and filtered using the NLTK12 python package. To identify a non-topic-specific vocabulary that appeared frequently across the entire corpus, an initial iteration of LDA was performed using a standard list of English stopwords and punctuation characters provided by the NLTK package. The identified vocabulary was then added to the list of stopwords for subsequent modeling, and further details can be found in Appendix 1.

All the tokenized and filtered text entries were transformed into vectors in an n-dimensional token space using the count vectorizer provided by the scikit-learn python package. LDA was subsequently applied to these vectorized texts, and the resulting topics were visualized using the LDAvis13 package.

Based on the identified topics, each text entry was assigned a percentage score indicating its alignment with each of the identified topics.

Description of trends in the European neutron science community

The data of individual authors, including their affiliations, city, and country, were extracted from the publications database. To understand global trends in the number of authors involved in neutron research, the unique Scopus author IDs were classified into six world regions aiming to provide insights into the dynamics of the neutron community.

Furthermore, the authors were categorized into two groups: “new,” representing authors appearing in the database for the first time at a specified year, and “old,” representing authors who were already present in the database prior to that year.

To visualize the neutron community, the authors’ locations (country and city, when available) were geocoded using the OpenStreetMap14 Nominatim API. However, due to inconsistencies in the author data, such as missing or mismatched city and country information, an adapted strategy was employed. Authors without a specified country in the publications metadata were excluded from the visualization. For authors with a provided country, their longitude and latitude coordinates were initially determined based on the country. In cases where automatic geocoding failed, a manual inspection was conducted, and the names of corresponding territories were updated (e.g., replacing “Libyan Arab Jamahiriya” with “Libya” and “German Democratic Republic” with “Germany”). Subsequently, for authors with both city and country information available, their coordinates were adjusted to the specific city. All authors were then represented as author-density distribution as a “heatmap” on a world map using the folium15 Python library.

Results & discussion

Data collection

The publications metadata were collected from the Scopus database via the Scopus Search API with the help of pybliometrics python package8. An API search for the keyword “neutron” yielded 320,357 hits, covering the years 1922–2022. In order to have a reliable benchmark, we also searched the publications metadata for two large European neutron sources, namely, the Institut Laue Langevin (ILL, Grenoble, France) and ISIS Neutron and Muon Source (Rutherford Appleton Laboratories, Didcot, UK). The search yielded 15,898 entries in the period 1968–2022 and 9149 in the period 1987–2022 for ILL and ISIS Neutron and Muon Source respectively (We note that a considerable ambiguity exists for the links between publications and affiliations. For example, publications linked to ISIS Neutron and Muon Source could have Rutherford Appleton Laboratory (RAL), or Science and Technology Facilities Council (STFC) in their list of affiliations instead of ISIS Neutron and Muon Source. Therefore, we are convinced that not all the publications originating from ISIS Neutron and Muon Source are present in our results, but we argue that our search delivered a representative fraction of these. On the other hand, the ILL neutron source has no alternative names or IDs, and therefore we suppose that our search results should include nearly all the publications containing ILL in their affiliations lists). Figure 1 represents a Venn diagram of the publications found on Scopus by aforementioned search queries. It can be seen that the majority of the publications originating from ILL and ISIS neutron sources are present in the search result for the “neutron” keyword. Moreover, all of the publications originating from both affiliations simultaneously are captured in the search result for the “neutron” keyword.

Fig. 1
figure 1

A Venn diagram of the publications entries found in Scopus. It visualizes that a search on the term “neutron” leads to capturing of about 65% of the entries which have ISIS Neutron and Muon Source or ILL sources in their lists of affiliation ID’s.

Data filtering – supervised machine learning

The initial Scopus query based solely on the term “neutron” yielded a corpus much larger than the actual body of publications produced by the neutron scattering research community. Filtering out publications with terms such as “neutron star” naturally reduced the corpus size, but a more comprehensive filtering approach was necessary to avoid bias from overly restrictive criteria.

By employing the machine learning model trained on community-labeled data, we were able to classify publications based on the entire vocabulary of each entry’s metadata, rather than relying on a few specific terms. The accuracy of this filtering approach was evaluated based on the overlap between the model’s classifications and those provided by the neutron scientists. As shown in Table 1, the trained model achieved an accuracy of 89%, which we deemed sufficient for identifying publications relevant to the neutron scattering community.

Table 1 Supervised machine learning accuracy results on the test data subset of 1360 publications. The algorithm identified 739 + 90 publications to be ENSA-related, of which 739 were actually identified by the scientists to be ENSA-related. As such the combinations of True/False and Positive/Negative yields an algorithm accuracy of 89%.

Another indication of the ‘accuracy’ of our approach is depicted in Fig. 2, where the publication output of the Nordic neutron scattering community is counted by a manual analysis method (reading all abstracts, Lefmann4) and overlaid with similar numbers generated from our analysis. The numbers for most countries are reproduced well, while publications from Sweden show a 20% discrepancy.

Applying this classification model to the entire corpus of publication metadata collected from Scopus (320,357 entries), selecting only the entries classified as “relevant” to the community, and restricting the publication years to 1930–2020, reduced the number of entries in our publication dataset from 320,357 to 121,731. Detailed descriptions of the vocabularies relevant to the topics modeled for the filtered-out publications and those kept for further analysis are provided in Appendices B and C, respectively.

Fig. 2
figure 2

A comparison of numbers of neutron publications for Nordic and Baltic Neutron Scattering Communities (NBSC) based on a Nordic report using manual classification4 (markers, NBSC) and our database (lines, ENSA).

Statistics

In order to focus our analysis on neutron research involving the European neutron community, we split the filtered dataset into two subsets: one including at least one author with an affiliation in a European country (70,830 entries), and the other without any authors with European affiliations (50,901 entries).

The main data of our analysis are displayed in Fig. 3 and show that the number of publications of the European community rapidly increased during the last century, while being rather steady over the last 20 years. In contrast, the non-European publications show a steadily increasing trend, with the European researchers being involved in slightly over half of all publications world-wide. Strong variations (per year) occur in the decade around 2000, specifically in Europe. The reason for this is that many of the articles published over this period were associated wit the bi-annual proceedings of the European Conference on Neutron Scattering (ECNS) and International Conference on Neutron Scattering (ICNS).

Fig. 3
figure 3

The evolution of the number of ‘neutron publications’ per year, as derived from the filtered corpus of collected publications. More than half of the publications have at least one European author.

Over the same time span, the mean number of authors per publication has increased by a factor of 3.5 (Fig. 4). This is an indirect demonstration of how the community has evolved through the formation of larger, multidisciplinary teams that combine the use of neutrons with other scientific methods.

Fig. 4
figure 4

Number of authors per publication per year, both for non-European and for European neutron publications.

NLP topic modelling – unsupervised machine learning

The analysis of the metadata vocabulary enables the classification of published works into topics using NLP. This topic modeling is unbiased regarding the text’s meaning, though the authors select the most concise topic names or labels. For instance, the most relevant terms for topic 10 include ‘propagation vector,’ ‘magnetic phase diagram,’ ‘external magnetic,’ ‘metamagnetic,’ ‘zero field,’ ‘noncollinear,’ and ‘spin ice.’ Consequently, the topic was labeled ‘Magnetism’ for conciseness. A complete description of the modeled topics with the most relevant and most frequent terms can be found in the Appendix C.

Figure 5 shows how the relative output varies between the topics over the entire time span, which was dominated by magnetism, excitations, macroscopic structures and chemical composition in the late fifties, while the most recent 5 years show how neutron methods are applied rather evenly distributed over all 10 topics.

Fig. 5
figure 5

Relative topic distribution of the ‘neutron publications’ with at least one European author. The names of the 10 topics coarsely describe the scientific field. When the NLP modelling identified a topic predominantly on the vocabulary that directly related to a specific applied neutron method, the topic name includes the method in brackets: Neutron Powder Diffraction (NPD), Inelastic Neutron Scattering (INS), Small Angle Neutron Scattering (SANS), Quasi-elastic Neutron Scattering (QENS), Neutron Diffraction (ND), and Neutron Activation Analysis (NAA). Other methods are present, but less strongly related to a single topic.

Description of trends in European neutron science community

Through the assigned affiliation of the authors in the corpus we were able to render the community distribution as geographical “heatmaps”, as shown in Fig. 6. Although there are (and were) only a few European neutron sources2,16 operational, over the last decades, the scientific community is now distributed much more broadly over European academia and industries. Because most neutron sources operate as international user facilities, even countries without a neutron source (e.g. countries like Italy and Spain) form a considerable fraction of the community.

Fig. 6
figure 6

The geographic distribution of the neutron science community depicted as heatmaps. (a) The heatmap for 1961–1970 is plotted (blue) above the (b) 2011–2020 community heatmap (red) to show how the community spread has evolved. [interactive version as supplementary information: Europe_timesplit_map.html ]. The heatmaps are not normalized to population density. Map generated using the Python Folium package.

The combination of NLP topic modelling and affiliations associated with any publication, allows to map the ‘topic density’ as geographical heatmaps, as shown in Fig. 7. For this visualization only publications, which are at least 50% aligned with one of the modelled topics were selected and attributed into the respective set. There is thus no double count across topics (with each publication appearing only in one of the topics). However, each affiliation involved in the publication is assigned a unit value for the publication. Therefore, each publication could be counted multiple times within the same topic. The total number of publications originating from each geographical location is expressed as colour intensity in Fig. 7. The heatmaps show a rather homogeneous distribution of scientific topics over the continent, indicating that each nation has scientists carrying out research on all topics. This map could be of great help for scientists looking for collaborations within or outside of their field of research.

Figure 8 shows the growth of the neutron community and the leading role of Europe, while Figs. 9 and 10 illustrate how a considerable fraction of this community is new to the field. Figure 10 demonstrates that about one third of the unique authors from Europe publish for the first time each year. This fraction indicates a large potential for the growth if the European Neutron Research community. The growth of this fraction is related to the fact that more recent publications tend to have more authors (Fig. 4) and that visiting scientists at the neutron user facilities nowadays often include researchers with little or no previous experience of neutron scattering research. This, at the same time, showcases that expertise in neutron scattering methods concentrated at European large-scale neutron facilities is crucial for the continuous growth of the community.

Fig. 7
figure 7

The geographical spread of publication density for the 2 of the 10 identified scientific topics: (a) Scientific instrumentation, (b) Mesoscopic structures (SANS). No correction is done for population density variations over the continent. [interactive version as supplementary information: Europe_neutron_all_topics_map.html ].

Fig. 8
figure 8

Trend and distribution of the authors involved in neutron publications around the world. The total number of unique authors around the world is 150,421.

Fig. 9
figure 9

The community of neutron scientists, separated into scientists that appear for the first time in a neutron publication (‘new’) and those that appear for a second time, or more (‘old’).

Conclusions

Leveraging published research and open-source machine learning toolkits, we have developed a method for conducting bibliometric analyses based on Natural Language Processing (NLP) and supervised machine learning. While the primary focus of this study is on the neutron research community, the methodology presented is innovative and broadly applicable to various scientific fields. This innovative approach, geared towards quantitative trend identification within scientific communities, was successfully tested on the international neutron scattering community, with a specific focus on the European landscape.

Fig. 10
figure 10

The evolution of the number of European authors, indicating the number of authors that appear for the first time in the corpus, that year (‘new’). The total number of unique authors from Europe is 71,311.

In the European neutron scattering community, our method revealed noteworthy trends. Despite a reduction in neutron sources, the neutron community in Europe kept growing, and has maintained its publication output rate over the past two decades, emphasizing the enduring significance of the use of neutron methods in scientific research. The continuous rise in unique authors, particularly among newcomers, indicates sustained interest and positivity within the neutron research community. Furthermore, an even distribution of publications and authors across diverse scientific topics highlights the community’s interdisciplinary nature and collaborative ethos, positioning it favorably amid changes in the scientific landscape, including the integration of new sources like the European Spallation Source (ESS7), and compact sources17,18,19.

Although we were able to gain remarkable insights into the structure and dynamics of the European neutron scattering community using this novel approach, there are several limitations to our method as applied in this work. Firstly, we used only publication metadata in our analysis and not the full texts of the publications. Extending the analysis to include the complete texts of publications would allow for a deeper examination of correlations between specific neutron techniques, the materials typically studied with these techniques, and other non-neutron techniques involved in such investigations. Furthermore, in our approach, we included all authors of each publication as being involved in neutron research. However, there is a clear distinction between users active at the neutron facility and other contributors to the research. A more detailed analysis of the roles of individual authors could be achieved by incorporating the full texts of publications in the analysis. Moreover, there is significant interest in similar analyses from the synchrotron user community20,21. We are confident that our method could be beneficial to them as well, and through collaboration, we can further develop and refine this approach.

The insights gained by the application of our method extend beyond neutron scattering, presenting a versatile methodology applicable to various scientific communities, especially those reliant on Large Research Infrastructures (LRIs) such as Synchrotrons, Accelerators, and Telescopes. Individual researchers and groups can benefit by establishing collaborations across adjacent fields, optimizing experimental approaches. Research infrastructures can leverage the method to swiftly understand evolving community needs, aligning facilities with dynamic research requirements. Governmental and funding bodies can make informed decisions, identifying research directions aligned with global policies and initiatives. Additionally, the method facilitates the allocation of short-term funds to actively contributing research groups, underlining its potential as a valuable asset for diverse stakeholders shaping the scientific research landscape.