The overarching goal of research is to produce knowledge. This involves ensuring that the accumulated knowledge is applicable to various research contexts as well as real-world settings. Across academic disciplines, one key criterion for achieving this is maximising the replicability of research. The exact definition of replicability varies across contexts; here, we follow the FORRT Glossary of Open Science Terms and the Turing Way Community, and use a broad definition of replicability as reaching the same conclusions when repeating a study with the same methods but new data (Parsons et al. 2022; The Turing Way Community, 2022). Replicability is typically distinguished from reproducibility, or reaching the same results when repeating the analysis of a study with both the same methods and the same data. Crucially, while the exact definitions might differ slightly across disciplines, a lack of replicability, in its broad sense, has recently been identified for large sets of studies in psychology (Open Science Collaboration, 2015; Klein et al. 2022), medicine (Ioannidis, 2005), economics (Camerer et al. 2016), and the behavioural and social sciences more generally (Camerer et al. 2018). Efforts to increase replicability rates have recently been discussed at great length, with suggestions to increase transparency (e.g., Asendorpf et al. 2013), engage in pre-registration (e.g., Nosek et al. 2022), and apply more rigorous statistical methods (e.g., Simmons et al. 2021). At the same time, the in-depth examination of replicability has also put the credibility of science as a whole to the test, calling for a “credibility revolution” (Angrist and Pischke, 2010; Korbmacher et al. 2023), as well as more for purposeful communication of scientific uncertainty to the public (e.g., Howell, 2020). Communication, both with the public and with fellow researchers, depends crucially on natural language, which is inherently ambiguous and multifaceted. We argue that the improper or negligent use of language can pose another major challenge to replicability. In the discussions on replicability, this challenge has not received much attention yet. Language plays a central role throughout the research process—from theory formulation, study design, and data collection, all the way to documentation and dissemination. As such, we call attention to its critical role and the ways in which it interacts with replicability.Footnote 1

Language as a medium of research

In research, language is the primary medium that conveys meaning to its users. Thus, its presence across the research process is ubiquitous. It is used, for example, to search or summarise existing literature, to define technical jargon accurately, to formulate research questions and hypotheses, and to communicate results and interpretations. However, natural language can be imprecise, ambiguous, and context-dependent (Leung et al. 2024), which can pose challenges to replicability. For example, ambiguous formulations of research hypotheses can affect replicability in two distinct ways (Scheel, 2022). First, it can lead to different interpretations of verbal elements within the hypothesis, resulting in different conceptions of how the hypothesis should be tested and how data should be interpreted to evaluate it. Two researchers testing the supposedly same hypothesis, thus, might end up with different results (e.g., Auspurg and Brüderl, 2021). Second, a vague hypothesis allows researchers more degrees of freedom when analysing and interpreting their data. Multi-lab studies have shown that these researcher degrees of freedom can lead to multiple possible analytical strategies, which often yield categorically different results (Silberzahn et al. 2018), blur the line between confirmatory and exploratory research, and might drastically inflate false positive rates (e.g., Simmons et al. 2011).

The intrinsic imprecision of language is exacerbated in academic communication contexts with language users who differ in their linguistic and cultural backgrounds (Vander Beken et al. 2020). Regardless of the language employed for scientific discourse, readers from different backgrounds will have varying degrees of proficiency in that target language and exhibit different degrees of distance to cultural references and conventions, and their academic backgrounds will affect their interpretation of technical jargon. In light of these considerations, the intrinsically imprecise nature of language as a medium for scientific communication might constitute a crucial factor in the observed low replicability rates.

Language as a tool for research

Language is also a central part of the research toolkit. It is integral for designing a study, preparing materials, and collecting and analysing data. For example, in the social and behavioural sciences, language is necessary for designing survey items and interview questions, as a modality to present experimental stimuli, or to deliver instructions to participants and informants. Language is also used to provide instructions among researchers (or research assistants), such as for experimental procedures, protocols, or data processing and documentation. When replicating a given study in a different language, the researcher must ensure that the translated materials are clear to understand, while ascertaining that the translated texts still capture the meanings as intended in the original study’s measures.

This is especially important for measurements in humans, where the translations are expected to measure the exact same constructs as in the original tests. Yet, this is often challenging in cross-linguistic studies, as the test materials might have been conscientiously translated but not tested for measurement invariance (Klein et al. 2018; Luong and Flake, 2023). The linguistic properties of the translated materials might differ from the originals, which might affect participants’ understanding of the test items’ or instructions’ meanings. This leads to measurement non-invariance, which means that the psychometric properties of items or questions are not equivalent and therefore the results are not comparable. In closely related languages and cultures (e.g., Dutch - German - English), examining and comparing the measurement properties of the tools in quantitative analysis may be sufficient; however, additional considerations are needed when conducting research in dissimilar languages and cultures. This requires a qualitative examination of aspects of the tool that may be perceived differently in the context and language of interest, or that may not apply at all. For example, participants across different populations might have varied comprehension proficiency of spoken, written or signed messages across cultural, educational, or clinical backgrounds. Moreover, direct translation may have different implications across cultures: cultural expectations on what is taboo might limit some types of stimuli; for example, it is inappropriate to show participants alcohol-related words in Arabic-speaking countries. This example shows that test materials with potentially taboo contents cannot be directly translated and applied in studies across cultures, limiting the breadth of cross-cultural and cross-linguistic replications. This is especially relevant for research where language is an object of study (see next section), but it also affects other types of research.

These challenges might increase with the cultural and linguistic distance between a population of interest and WEIRD populations on which much of the social and behavioural sciences are based (Henrich et al. 2010; Blasi et al. 2022). This is due to the inherent linguistic and cultural differences in material translations, which might thus impede replication studies across languages from obtaining comparable results.

While we have considered natural languages so far, some considerations also need to be extended to programming languages. In most scientific disciplines, the use of computer code is common for creating research software, and for collecting, processing, or analysing data. While there are several structural differences between the two regarding replication, many of the issues described for human languages also apply to the use of programming languages. To replicate a study, the replicator needs to be able to reproduce and, hence, understand each relevant decision made for the original study. If computer code is used to collect, process, or analyse data, other researchers must be able to comprehend what was done to use this information for their replication work. This requires ensuring that the code is accessible and keeping it well-documented for the use of other researchers. Investigations into the computational reproducibility of research have revealed that, oftentimes, even the requirements for reproducing results using the same code are not met because the code is either not shared or not properly documented (Perkel, 2020; Krähmer et al. 2023).

Notably, different researchers also use different tool stacks, including different programming languages. As with human languages, translations are possible, but they are often associated with some degree of conversion loss. A particular problem that is unique to the realm of programming languages is the use of proprietary solutions that not every researcher has access to, a limitation that disproportionately affects some researchers more than others. Overall, for the cases of both human, as well as programming languages, it is clear that language as a tool for research can, in several ways, introduce difficulties in replications due to challenges related to translating, conveying, and preserving meaning.

Language as an object of study

In several disciplines, language itself or the role of language in cognition, society, and culture are objects of scientific inquiry. When language is the object of study, its role in replicability takes on another, more theoretically relevant dimension compared to the issues in language as a communication or research tool, which we discuss above. Specifically, low replicability when language is an object of study may reflect a lack of generalisability across languages, rather than methodological artefacts.

Replication in cross-linguistic research

The role of language is ubiquitous in everyday life; thus, it is important to understand aspects such as language development and the use of language in documenting cultural knowledge. Language is a culturally evolved, complex adaptive system (Winter, 2014) that interacts with a large variety of human experiences. The structure of languages can differ substantially (Evans and Levinson, 2009), and these differences may affect other parts of cognition, such as working memory (Amici et al. 2019), attention (Wang, 2021), and perception (Kemmerer, 2023). In light of linguistic diversity and its complex interaction with cognition and behaviour, the question arises as to whether we should always expect findings to be replicable when a study is conducted in a different language or culture. Does a failed replication across languages suggest non-replicability, non-generalisability, or merely that the phenomenon in one language cannot be investigated with the same study design, measurement, sample selection, or materials in another language? For example, some research suggests that the processing and acquisition of nouns might differ from verbs (e.g., Cazden, 1968; Maratsos, 2013). Replicating such an asymmetry across different languages can be challenging or even impossible, because distributional, semantic, and morphological properties of categories such as nouns and verbs can drastically differ across languages, with some languages having been described as lacking such a distinction (Sasse, 2001). Covariates related to culture and social factors that are intricately connected to the language we speak may render a failed replication uninterpretable: it becomes unclear if a failed replication constitutes evidence against the original finding or a limitation of the context in which this finding can be obtained (Grieve, 2021; Roettger, 2021a). For example, showing that an intervention aiming to improve reading skills in children with developmental dyslexia in English does not work in German may be due to the non-replicability of the original English study, but it may be that the characteristics of the German language, such as the morphological complexity or the closer relationship between print and speech sounds, yield the intervention ineffective. Thus, without conducting further research, it is difficult to draw conclusions from such a failure to replicate.

Language as data

Challenges regarding replicability go beyond questions of translation when language is an object of study. These are relevant not only for replicability in the context of cross-linguistic research, but also when research is replicated or reproduced within a language. In areas such as communication sciences and linguistics, for example, audio or video recordings of language production, news articles, social media posts, or podcasts may be used as data sources. Depending on the source and type of data being used, research on language as an object of study often requires preprocessing steps, which can be complex and resource-intensive (in terms of time and/or required computing resources). Typical preprocessing steps include the transcription of audio material into text, manual or (semi-)automated coding/classification of content/text, or applying natural language processing (NLP) pipelines, such as for part-of-speech (POS) tagging or named entity recognition (NER). The pipelines are mostly developed for the most-studied languages; as such, resources for under-studied languages may not exist or be of lower quality, as there is less available data that can be used for their development (e.g., Chilson et al. 2024).

Similar to data analysis, preprocessing of linguistic data entails various researcher degrees of freedom, which can be particularly impactful for complex processing pipelines that are common for text- or even more so for audio- and video-as-data studies (Coretta et al. 2023; Lukito et al. 2024). Of course, the issue of translation and potential conversion loss, akin to the challenges faced when language is used as a tool in research, also warrants consideration in this context. Generally, if we rely on specific tools or tool chains in language-as-data settings, we need to properly document those. The methods used for processing and analysing data are hence especially important. Besides documentation, an important step that is often somewhat neglected in research with text as data is the validation of methods (Birkenmaier et al. 2023). For instance, during preprocessing procedures such as POS tagging, data validation may involve implementing cross-validation techniques, wherein the POS tags generated by the NLP pipeline are systematically compared to a manually annotated “gold standard” dataset to quantitatively assess accuracy. Another example for validation in a language-as-data setting could constitute a review by linguistic experts, who examine a representative sample of the text data to ensure that the automated tagging generated via NLP aligns with expert human annotations. When working with under-studied languages, researchers might need to validate the NLP pipeline by checking for biases or errors unique to that language. This could involve running an error analysis on the output to identify common misclassifications and refining the pipeline accordingly. The validation of computational text analysis methodologies has become increasingly critical with the proliferation of artificial intelligence (AI) tools, particularly large language models (LLMs). The reliability and validity of annotations or classifications generated by these technologies have already been demonstrated to present significant challenges (Kristensen-McLachlan et al. 2023; Pangakis et al. 2023; Reiss, 2023).

For research using language as data, however, issues related to replicability are not limited to data processing and analysis. Another domain that can produce replicability challenges is data access. Commonly used data sources, such as audio or video recordings, news texts, and social media content, are often proprietary and controlled by (commercial) third parties, such as media organisations and online platforms. A prevalent issue in work with textual but also image, audio, and video data in many fields across the social and behavioural sciences as well as the humanities is the change or closure of application programming interfaces (APIs) of platforms through which researchers can access data. Besides being possibly proprietary, video, audio, and text data are often also sensitive or involve data that is culturally inappropriate. These two key attributes introduce legal and ethical concerns regarding copyright and intellectual property and can impact research replicability, especially also when it comes to data sharing. For the specific case of data from social media, Davidson et al. (2023) have recently argued that “[…] platform-controlled social media APIs threaten open science […]” and studies by Küpfer (2024) and Knöpfle and Schatto-Eckrodt (2024) have demonstrated that the replicability of studies based on data from Twitter is strongly and negatively impacted by changes in the platform APIs and restrictions imposed on data sharing in their Terms of Service (ToS).

Recommendations and ways forward

Based on the considerations in the previous sections, we aim to put forth recommendations for addressing challenges for replicability related to language as a) a medium of research, b) a tool for research, and c) an object of study.

Community-driven refinement of term definitions for clearer conceptualisation

Considering language as a medium of research, jargon is unavoidably used for conveying complex ideas. Technical terms, theoretical descriptions, and research questions must be as precise as possible to communicate effectively. To improve replicability, the first step in this process is identifying terms that are inherently ambiguous or lack consensual definitions in the literature (Leung et al. 2024). Such terms tend to be more challenging to operationalise, which may lead to differences in measurement across studies and subsequently affect the replicability of the results. Reaching a broader consensus on the interpretation of technical terminology can support a more structured approach to theory formulation. This would require collective effort within scientific fields and communities to define and refine consensual scientific term definitions iteratively across time (Leising et al. 2024; see also Parsons et al. 2022 for a successful crowd-sourced glossary of term definitions).

Formalisation of research questions and hypotheses for effective communication

After examining, defining, and agreeing on the specific attributes of the concepts involved, researchers can create stronger connections between empirical evidence and theoretical predictions (e.g., Scheel, 2022). For example, using transparent and formalised formats to pose specific and machine-readable research questions and hypotheses could help increase the falsifiability of hypothesis tests (e.g., Lakens and DeBruine, 2021). Such hypothesis specifications do not only capture the conceptual descriptions of our predictions but also the operationalisation and the statistical predictions of the empirical tests. This can avoid using solely verbal descriptions to make hypotheses, thus reducing the degrees of freedom between the conceptual descriptions and the operationalisation or statistical predictions.

Increasing linguistic precision may, to a certain extent, rely on the use of statistical and mathematical expressions to capture a prediction. However, this may sacrifice the ease of communicating scientific results across disciplines, as well as to the public, when using technical expressions rather than the layman’s language to disseminate scientific information (Bullock et al. 2019). Switching between a statistics-oriented scientific language system and a layman’s language system to communicate research findings may impose difficulties in knowledge transfer and communication among researchers and the public. Both language systems are, however, equally important as the media of research and can co-exist to serve different audiences, be it the researchers of other fields or the general public. Hence, we call for the enhancement of proficiencies in both scientific language and science communication with the public in higher education. Researchers would then become more equipped with the skills to communicate science to peer researchers using formalised scientific language while disseminating information to the public in non-technical language.

Material and data sharing for comparable replications across communities

When language is a tool for research, to increase the comparability of replications across different languages, future work could focus on making high-quality resources available for under-studied languages. This involves developing and evaluating the quality, equivalence and applicability of research tools to different languages and generating language-specific instruments when a direct transfer to the different linguistic and cultural contexts is not possible. To achieve this, the scientific community should strive towards openness, not only by sharing already existing instruments but also by documenting and sharing the steps taken in their development. The measurement invariance of these tools across languages is the critical methodological issue to be addressed (e.g., Meredith, 1993). As a first step, researchers should consider if it is appropriate to apply the same tool in another language. This step requires close collaboration with researchers who are very familiar with this culture, ideally, who grew up in it and speak the language(s) fluently. Active exchange with the community will allow researchers to take cultural and linguistic differences into account appropriately.

Development of invariant measurement for cross-linguistic replications

As a second step, researchers should conduct quantitative analyses to ensure measurement invariance. Such analyses are based on multi-group confirmatory factor analytic methods (see Hildebrandt et al. 2016, for details and extensions to nonlinear approaches), which, through parameter restrictions, allow for testing the equivalence of item difficulty, discriminative power, and item reliability within a measurement tool across languages. These qualitative and then quantitative analysis steps combined will allow researchers globally to create and adjust tools that can be used in their own languages and, thus, potentially contribute to reducing the WEIRD problem in research with human participants.

Promotion of reusable and interoperable use of programming languages

With regard to programming languages, we urge the adoption and promotion of practices that increase reusability and interoperability, such as proper documentation (e.g., via annotating research material and code and through version control of all software and research-related tools), as well as avoiding proprietary closed-source solutions. In addition, we emphasise the recommendations of previous scholars to rely on free and open-source tools for scientific research (e.g., Asendorpf et al. 2013).

Final remarks: Big Team Science drives open science and large-scale replications

Language as an object of scientific inquiry warrants both strong quantitative and mechanistic theories on how language, behaviour, and cognition interact in general, and how language-specific traits moderate these interactions. Without such efforts, the field lacks a principled way of integrating empirical findings and, ultimately, advancing our understanding of human language and related areas in an effective manner (Roettger, 2021b). If scientists fail to (or cannot) specify the contexts where a given effect is replicable, and if they dismiss failed replications due to context sensitivity, scientific progress is seriously impeded (Simmons et al. 2011). Theory building and data collection form a closed loop; as such, large-scale replication efforts should be conducted involving researchers dispersed across geographic locations, languages, and cultures. For example, the recently launched ManyLanguages consortium (many-languages.com) aims to directly replicate experimental findings related to language sciences across many languages (Faytak et al. 2024).

This recommendation feeds into our next suggestion, which is relevant whenever language is used as data: as we discussed in the paragraph above, for language as a tool, we need to develop tools for processing and analysing language data (text and audio) for multiple languages. Notably, this cannot be done without a large-scale initiative to produce sufficient and accessible written materials in each language for the continuous development of these study resources across contexts in the first place. In addition, when processing and analysing language data, for reasons of transparency and accessibility, open-source tools should be given preference and all steps in the pipeline should be properly documented and explained. The importance of documentation and the use of open-source solutions also extends to the use of programming languages for research in the social and behavioural sciences in general.

We encourage researchers to attempt replications across different countries and languages, even when language is not the primary focus of the study. While linguistic and cultural variations introduce complexities, they should not obstruct cross-cultural replication efforts. Instead, we suggest that researchers aim to account for context-specific factors that may affect the generalisability of their findings and to provide clear, comprehensive documentation of methodologies and potentially relevant contextual variables. Collaborative efforts across diverse cultural and linguistic contexts are essential for enhancing the robustness of research and an important step towards improving the generalisability of scientific findings.