Abstract
This study investigates whether cohesive and coherent patterns differ across human-translated, machine-translated and non-translated English texts, and whether these patterns remain consistent across four distinct registers. Drawing on five categories of metrics from Coh-Metrix 3.0, namely referential cohesion, personal pronouns, connectives, latent semantic analysis and situation model, the analysis employs principal component analysis, flexible discriminant analysis and Permutational Multivariate Analysis of Variance to triangulate results. The findings reveal that: (i) academic texts exhibit significantly higher levels of cohesion and coherence than other registers, particularly in coreference, semantic similarity, logical connectivity and intentionality, whereas fictional texts, shaped by story-telling conventions, tend to create cohesive chains through anaphoric reference to maintain narrative fluidity and character interaction; (ii) both human and machine translations show a general tendency toward explicitation in comparison to non-translated texts, although this trend is not consistent across all cohesive and coherent dimensions or registers; and (iii) register variation exerts a stronger influence on cohesive and coherent patterns than translation variety. These results underscore the importance of adopting nuanced, context-sensitive approaches to studying translated language and of situating such inquiries within both technological and functional frameworks. Moreover, the observed patterns provide evidence for the hypothesis of risk aversion, suggesting that human translators often adopt risk-averse strategies to reduce potential misunderstandings, while the explicitation in machine translations may reflect underlying algorithmic biases. Taken together, these findings contribute theoretical, methodological and practical insights to the ongoing investigation of translation universals and the evolving role of machine translation in translation practice and pedagogy.
Similar content being viewed by others
Introduction
Identifying systematic patterns that distinguish translated texts from non-translated ones has long been a central topic in corpus-based translation studies. This line of enquiry has been conceptually underpinned by the notion of ‘translation universals’ (Baker, 1993), offering valuable insights into the cognitive, linguistic and social dynamics that shape translated language. According to Baker’s (1993, pp. 243–246) definition, translation universals are features “which typically occur in translated text rather than original utterances and which are not the result of interference from specific linguistic systems”. To date, a number of such universal features have been proposed, supported or contested, including explicitation (Blum-Kulka, 1986; Kajzer-Wietrzny, 2015; Klaudy and Károly, 2005; Marco, 2012, 2018; Murtisari, 2016; Zhang et al., 2020; Zufferey and Cartoni, 2014), simplification (Kajzer-Wietrzny, 2015; Laviosa, 2002; Liu and Afzaal, 2021; Liu et al., 2023; Niu and Jiang, 2024; Xiao, 2010), normalisation (Delaere et al., 2012; Delaere and De Sutter, 2013; Lapshinova-Koltunski, 2015, 2022; Redelinghuys and Kruger, 2015), shining-through (Cappelle and Loock, 2013, 2017; Evert and Neumann, 2017; Lapshinova-Koltunski, 2022; Teich, 2003; Xiao, 2010) and unique item (Kujamäki, 2004; Rabadán et al., 2009; Tirkkonen-Condit, 2004).
Among these, explicitation is arguably the most widely investigated and consistently reported phenomenon (Chesterman, 2011). Blum-Kulka (1986) defined explicitation as the tendency to use explicit cohesive devices even when unnecessary, and Baker (1996, p. 180) contended that translated texts are inclined to “spell things out rather than leave them implicit”. As a result, translated texts may display higher levels of textual cohesion, making them statistically distinguishable from non-translated texts on the basis of features related to cohesion and coherence (Øverås, 1998). Despite sustained scholarly interest, several issues remain in the research on explicitation, especially in terms of scope and methodology.
First, most existing studies have concentrated on explicitation in human translation, but this tendency has been less explored in machine translation. Only in recent years has machine translation begun to attract scholarly scrutiny. Lapshinova-Koltunski (2015) compared the occurrences of conjunctions, proportion of pro-nominal phrases and general nouns between phrase-based and statistical machine translation and human translation, finding that human translation is characterised by less connectivity in coherence than machine translation. In contrast, Krüger (2020) focused on three types of explicitation shifts, namely lexical insertion, lexical specification and relationship specification in human and DeepL translated texts, revealing more explicitation in human translation in all of these categories. Jiang and Niu (2022) analysed the discourse coherence in Google and DeepL translated essays through the comparison of connectives, latent semantic analysis and situation model, finding that human translation and machine translation both use more connectives, compared to the original language, but human translation makes use of more deep cohesion than machine translation. These studies collectively point to the distinctive characteristic of machine translation as a translation variety, underscoring the need for more targeted and systematic exploration.
Second, research on explicitation has often overlooked register variation, limiting the generalisability of its conclusions. For instance, Marco (2018) analysed the occurrences of connectives in translated and non-translated Catalan literary texts, finding that there are no significant differences in the overall frequency of connectives. Similarly, Song (2022) focused on the use of connectives in the translated Chinese version of The Lord of Rings, but he discovered that explicitation can be both found in these two texts. Zhang et al. (2020), on the other hand, investigated the frequency of personal pronouns in Chinese children’s literature translated from English, demonstrating that personal pronouns are used more frequently in translated texts than original children’s books. Nevertheless, these studies focused on literary texts, and the results may not be generalised to non-literary texts, making it difficult to determine whether observed features are register-specific or translation-specific. As De Sutter and Lefer (2020, p. 6) and Evert and Neumann (2017, p. 50) argued, failing to account for register can obscure our understanding of translational phenomena. The register should be treated as an integral factor in shaping the linguistic characteristics of translations (Kruger, 2019).
Third, a large proportion of earlier research adopted a univariate approach, examining linguistic features in isolation. While such studies have yielded important insights, they often lead to fragmented or contradictory conclusions. For instance, in the comparison of two indices of lexical density, Xiao (2010) found that translated Chinese differs significantly from non-translated Chinese based on the ratio of content words versus the total number of words, but there is no significant difference if measured by the standard type/token ratio. However, the characteristics of translated texts cannot be determined or reflected by the observation of a single feature. On the contrary, the tendencies in translated texts, whether they be simplification, explicitation, normalisation and others, are supposed to be expressed through the combination of mixed features in a way similar to genetic information. Evert and Neumann (2017) also suggested that the interactions of different factors have been rarely examined, and recommended that multivariate techniques should be adopted to examine the systematic and structural properties of translated texts. Therefore, it remains unclear how the linguistic properties of translations are influenced by spatial, temporal, technological, cognitive and many other factors solely based on univariate analysis (De Sutter and Lefer, 2020).
Against this backdrop, the present study aims to investigate how translation variety and register divergence jointly shape patterns of cohesion and coherence in English texts. To this end, three multivariate techniques, including principal component analysis (PCA), flexible discriminant analysis (FDA) and Permutational Multivariate Analysis of Variance (PERMANOVA), are employed to capture the multidimensional nature of linguistic variation. The study is guided by the following three research questions: (i) Are translated and non-translated texts characterised by different patterns in cohesion and coherence? (ii) Are these characteristics consistently observable across different registers? (iii) Are these differences more strongly influenced by translation variety, register variation or their interaction?
By addressing these questions, the study makes several theoretical, methodological and practical contributions to the research on translation universals and machine translation. Theoretically, it advances our understanding of explicitation by situating it within both technological and functional contexts. Methodologically, it demonstrates the value of multivariate analysis in revealing latent textual structures that are often missed by univariate methods. Practically, it offers insights into how machine translation aligns with or diverges from human translation norms, thereby informing both translator training and machine translation system development. In an era where machine-generated texts increasingly permeate professional and everyday communication, understanding the cohesion and coherence profiles of these outputs is crucial. This study thus not only interrogates generalised claims about translation universals but also calls for more nuanced, context-sensitive approaches to characterising translated language.
Theoretical underpinnings
This section offers an overview of fundamental concepts in this work. Section “Cohesion and coherence” introduces the two concepts of cohesion and coherence, while section “The concept of explicitation revisited” critically examines explicitation as a frequently explored linguistic characteristic in research on translation universals. Section “The hypothesis of risk-aversion and algorithmic bias” discusses two theoretical models explaining the explicitation tendency.
Cohesion and coherence
Cohesion and coherence are two notions pertaining to the connectedness of a discourse, whether it be in spoken or written form, but they are different in certain aspects (Bublitz, 2011, p. 37). Halliday and Hasan (1976, p. 4) defined cohesion as a semantic item linking the meaning of the text and creating context. It indicates that semantic relations between the current item and the previous or subsequent ones exist via lexis or structures. In contrast, coherence is believed to be “a cognitive category that depends on the language user’s interpretation and is not an invariant property of discourse or text” (Bublitz, 2011, p. 38). In other words, coherence can be understood as a construct reflecting how well the receiver comprehends a text, and needs to be assessed through asking readers questions and evaluating how much information they obtain from the text (McNamara et al., 2014). Due to its subjective and often intangible nature, coherence remains a concept that is underexplored, marked by its complexity and ambiguity (Sinclair, 1991, p. 102). The distinction, then, lies in the fact that while cohesion is a surface-level feature that can be observed and measured in the discourse itself, coherence is a mental construct that resides in the mind of the reader or listener (Carrell, 1982; Givon, 1995; Graesser et al., 2004).
Halliday and Hasan (1976) proposed a taxonomy of cohesive devices, comprising five principal categories: reference, conjunction, substitution, ellipsis and lexical cohesion. To be more specific, reference is often situational and relies on linguistic cues to help readers link propositions, clauses or sentences within their mental representation of the text (Halliday and Hasan, 1976; McNamara and Kintsch, 1996). For instance, in Example (1), the personal pronoun he refers to Mahmoud el Zaki, creating a clear referential tie between the two clauses. Similarly, in Example (2), it refers back to the song, maintaining textual continuity. Conjunctions can be categorised based on the relationships they signal, including additive (and, furthermore, in addition, etc.), adversative/contrastive (however, but, in contrast, etc.), causal (because, so, therefore, etc.) and temporal (e.g. then, next, finally, etc.). By way of illustration, Example (2) indicates the use of because to link two clauses and explain the rationale behind the song selection. Substitution involves replacing an element with another, often realised through noun phrases (e.g. one), verb phrases (e.g. did/do) and clauses (e.g. so). For instance, in Example (2), one substitutes for the song, and do replaces the verb sing, and in Example (3), so is used to replace the sentence that I would call that a whirlwind. Ellipsis, often regarded as zero substitution, entails the omission of an element that is recoverable from context. In Example (4), women is omitted after two more, and the reader is expected to infer it from the earlier text. Finally, lexical cohesion is based on the identity or semantic similarity of reference across items, realised through repetition, synonyms, superordinates, general terms or collocations. Example (5) illustrates this with repeated items such as television, reading and books, and semantically related words like event and activity.
-
(1)
His name was Mahmoud el Zaki and he was one of the Parquet’s rising stars. (FLOB, L09)
-
(2)
I was prepared to do this song because it is one that I like. (FLOB, E35)
-
(3)
“Would you call that a whirlwind? I don’t think so. I think by this age I know what I want”. (FLOB, A10)
-
(4)
Smith, of Marion Road, Charlton, was originally charged with sex attacks on eight women and robbing two more, from 1988 to 1990. (FLOB, A13)
-
(5)
It is rare to find parents and educators actively promoting a television series (other than the specifically didactic ‘schools’ broadcasts) and treating it as a cultural event. This reflects a deeply rooted ambivalence about television as entertainment, which is directly linked to attitudes surrounding children’s reading. Watching television is inevitably regarded as an activity less worthwhile than reading, and for long has been accused of seducing children away from books. (FLOB, G40)
-
(6)
During the battle, it was Templars who directed the devastating arrow power that broke the Scottish spear schiltroms, and it was Templar Knights who led the final cavalry charge that destroyed Wallace’s army. (FLOB, N25)
McNamara et al. (2014, p. 63) further elaborated on lexical cohesion by identifying five principal forms of lexical referential overlap: nouns, pronouns, arguments, stems and content words. Noun overlap occurs when the same nouns are repeated across sentences, reinforcing topical continuity. Pronoun overlap involves the consistent use of pronouns with matching gender and number, ensuring referential clarity. Argument overlap encompasses two scenarios: the repetition of a noun in singular or plural form across sentences or the use of matching personal pronouns to maintain reference to the same entity. Stem overlap captures shared lemmas across varied grammatical forms, reflecting semantic continuity even when word forms differ. Finally, content word overlap measures the proportion of shared content words between sentence pairs, highlighting lexical consistency.
While cohesion can be systematically analysed using observable grammatical markers or lexical repetition, coherence is less readily accessible from the discourse surface. Nevertheless, computational tools such as Coh-Metrix (McNamara et al., 2014) offer approximate indicators of coherence, including semantic similarity and situation models. Coherence emerges at the semantic level and can be computationally modelled using techniques such as Latent Semantic Analysis (LSA) (Landauer and Dutnais, 1997), Word2Vec (Mikolov et al., 2013) or Bidirectional Encoder Representations from Transformers (BERT) (Devlin et al., 2019). This study employs LSA via Coh-Metrix because it is an integral and validated component of the Coh-Metrix tool, which forms the basis of the coherence measurements. LSA operates on the assumption that words acquire meaning from the contexts in which they occur (McNamara et al., 2014, p. 66), representing deeper semantic relationships even when words do not appear in close proximity. For instance, in Example (5), the word school is situated within a semantic field that includes parents, educators and children. Similarly, in Example (6), the word battle co-occurs with other war-related terms such as arrow, spear, cavalry, knights and army, reflecting a coherent semantic cluster.
Beyond the semantic dimension, readers also construct mental representations of the events, characters and settings described in a text. These representations are referred to as situation models (Dijk and Kintsch, 1983) or mental model (Johnson-Laird, 1989). Situation models are central to discourse comprehension, as they allow readers to mentally simulate the narrative world. Zwaan et al. (1995) identified three dimensions of situation models, namely temporality, spatiality and causality. McNamara et al. (2014) further distinguished between intentionality and causality, contending that intentionality refers to the actions of animate agents as part of plans in pursuit of goals, whereas causality is described as the mechanisms that may or may not be driven by goals of people. Coherence is maintained when any of these four dimensions remain continuous or logically connected; where discontinuity arises, additional cohesive devices become necessary to restore coherence. Example (6) demonstrates this through the use of causal verbs such as break and destroy, which establish a cause-and-effect relationship and support the narrative of triumph in warfare. These verbs reinforce coherence through causality and intentionality.
In summary, cohesion is an observable feature of discourse that can be measured through linguistic devices such as conjunctions, referential ties, and lexical overlap. In contrast, coherence is a higher-order construct reflecting the mental representation of meaning, which can only be approximated through measures like semantic similarity and situation modelling. Understanding both cohesion and coherence is essential for analysing how discourse is structured and understood by readers or listeners.
The concept of explicitation revisited
As discussed, the tendency for translated texts to exhibit increased cohesion compared to their non-translated counterparts has been widely described as explicitation. Blum-Kulka (1986) formally introduced this concept, arguing that translators often employ more cohesive devices in the target text than are strictly necessary for comprehension. However, the roots of the concept can be traced back to Vinay and Darbelnet (1958, p. 9) as a translation procedure that “consists in introducing in the target language details that remain implicit in the source language, but become clear through the relevant context or situation”.Footnote 1 Shuttleworth (1997, p. 55) described explicitation as a phenomenon in which target texts tend to express source text information more explicitly than the original, often through the addition of explanatory elements and enhanced communicative cues.
A significant development in explicitation studies was Klaudy’s (1998) typology, which classified explicitation into four types: obligatory, optional, pragmatic and translation-inherent. This categorisation is particularly important as it distinguishes explicitness that emerges in the translation process itself from that which arises due to language or cultural constraints. In her definitions, obligatory explicitation occurs due to structural differences between the source language and target language in grammar or semantics, and optional explicitation is because of the strategies to build texts or stylistic influences across languages in the form of additional connectives or emphasisers (Klaudy, 1998). While pragmatic explicitation arises for the clarification of cultural-specific items by translators due to assumed cultural differences, translation-inherent explicitation is described as an inevitable consequence of all translational activities.
However, this typology has not gone unchallenged. Englund Dimitrova (2005) argued that the concept of translation-inherent explicitation is not sufficiently clear to understand, and pragmatic explicitation in fact belongs to optional explicitation. Becher (2010) also criticised the vagueness in the definition of the explicitation and supports the asymmetry hypothesis proposed by Klaudy and Károly (2005, p. 14), which means that there tends to be more explicitation than the corresponding implicitation in the translated language. Indeed, Kruger (2019) found that the level of explicitness exceeds that of implicitness in translated English from Afrikaans based on the frequency of the optional complementiser that.
Despite ongoing debates regarding its definition and classification, most scholars agree on the underlying premise that translated texts tend to contain more communicative cues to facilitate comprehension. This increased explicitness may manifest through denser cohesive ties or the explicit articulation of information that remains implicit in the source text. Consequently, explicitation has become a central concept in corpus-based translation studies, where it has been investigated using various linguistic indicators. For example, Konšalová (2007) examined the morphosyntactic structures between Czech and German, discovering a strong tendency towards explicitation in both Czech and German translations. Jiménez-Crespo (2015) focused on the verbal use in two production stages of Spanish-English translation, finding that explicitation varies under different production conditions. Other indicators used to investigate explicitation include connectives (Jiang and Niu, 2022; Marco, 2018; Song, 2022; Xiao, 2010, 2015; Zufferey and Cartoni, 2014), personal pronouns (Xiao and Hu, 2015; Zhang et al., 2020), optional complementiser that (De Sutter and Lefer, 2020; Kruger, 2019; Kruger and De Sutter, 2018; Olohan and Baker, 2000) and mean length of sentence (Hu et al., 2019; Xiao and Hu, 2015).
Taken together, these studies underscore the robustness of explicitation as a translational phenomenon. They also highlight the importance of considering both linguistic and contextual factors in analysing the explicitness of translated texts.
The hypothesis of risk-aversion and algorithmic bias
While explaining the tendency towards explicitation in human translation, Pym (2005, 2015, 2020) formulated the risk-aversion hypothesis, suggesting that translators tend to explicate the information to reduce ambiguity and misunderstanding for the reader. Pym (2015, 2020) considers translation as a process of risk management, by which translators may face three types of risks, including credibility risk, uncertainty risk and communicative risk. Credibility risks focus on the social networks, involving the danger of losing trust from clients, end-users and other participants. Uncertainty risks entail the cognitive process when translators make linguistic decisions on how to render. Communicative risks, as its name suggests, are related to the fear of unfulfilling the expected communicative effects. These risks are correlated, meaning that linguistic uncertainties may further lead to miscommunication and even loss of trust from the clients (Pym, 2020, p. 449).
This model resonates with recent developments in the cognitive approach to translation, particularly the 4EA (embodied, embedded, enacted, extended and affective) paradigm of cognition, also referred to as situated or social cognition (Halverson, 2015; Milošević and Risku, 2021; Muñoz Martín, 2016; Risku and Windhager, 2013, 2013; Robinson, 2020). The basic view of situated cognition is that human cognition is embedded, embodied, enacted, extended and affective, or in other words, human cognitive process is interacted with the environment, mediated by the body, oriented to actions and supported by artefacts (Rowlands, 2010, pp. 51–84). From this perspective, translators’ preference for explicit, risk-reducing choices can be seen not as an isolated cognitive tendency, but as an adaptive strategy shaped by the broader institutional and communicative context of the translation industry.
The hypothesis of risk-aversion has been supported by several studies. For instance, Kruger (2019) examined several factors that may cause the omission of the complementiser that, finding that translators favour using explicit that in contexts which trigger low communicative risk, such as in fictional texts. Kruger and De Sutter (2018) also investigated several situations where that is either explicit or implicit, revealing that translators tend to avoid omitting that in registers, whether omission is more conventional (e.g., creative writing and reportage) and use the most frequent and formal option. Delaere and De Sutter (2013) provided further evidence by investigating the lexical choice between translated and non-translated Dutch, demonstrating that translators tend to favour safer, more mainstream linguistic options in their choices compared to original writers. They argued that the potential reason behind the risk-averse behaviours of translators is that they are more influenced by their perception of the target audience. Generally, these studies reveal a consistent pattern: when faced with uncertainty, translators gravitate toward options that minimise risk, enhance clarity and conform to normative language use. If this interpretation holds, one would expect human-translated texts to display a higher frequency of cohesive devices than original texts, even in registers where lower cohesion is stylistically acceptable.
In parallel, tendencies shown in machine translated texts are believed to be linked to algorithmic bias, where the linguistic patterns in the training data, often drawn from human translations, are not merely replicated but amplified by statistical or neural models (De Clercq et al., 2021; Jiang and Niu, 2022; Luo and Li, 2022; Niu and Jiang, 2024; Vanmassenhove et al., 2019, 2021). From this perspective, machine translation may inherit and intensify the explicitation patterns found in its human-produced training corpora. Consequently, it is reasonable to hypothesise that machine translation output might also exhibit strong tendencies toward explicitation, potentially even exceeding those found in human translations under certain conditions.
Research methodology
This section provides an exposition of the data and methodologies employed in the present study. The structure of the corpora is elaborated upon in section “Corpora design”, followed by an introduction to the measurement indices in section “Measurement”. Considering the research objectives and data distribution, section “Data analysis” elucidates on three non-parametric multivariate analysis techniques.
Corpora design
In order to make a comparison between translated and original English, two balanced corpora, the Corpus of Chinese into English (COCE)Footnote 2 (Li and Yang, 2017; Liu and Afzaal, 2021) and the Freiburg-LOB Corpus of British English (FLOB) (Hundt et al., 1999) were used in the present study. Furthermore, for the specific purpose of contrasting and investigating machine translationese, two neural machine translation corpora were generated by rendering the texts in the Lancaster Corpus of Mandarin Chinese (LCMC) (McEnery and Xiao, 2004) through two popular neural machine translation tools, namely DeepLFootnote 3 and Google TranslateFootnote 4. We name them the Lancaster Corpus of Mandarin Chinese Translated into English by DeepL (LCMCTD), and the Lancaster Corpus of Mandarin Chinese Translated into English by Google Translate (LCMCTG) respectively. In this way, the four corpora are comparable in terms of the number of texts, registers and types, and three translation corpora share the same source language (see Table 1).
Measurement
The present study takes a quantitative approach using a great number of linguistic features extracted from Coh-Metrix 3.0 (McNamara et al., 2014)Footnote 5. Coh-Metrix was developed by McNamara et al. (2014) as a powerful tool providing measurement of cohesion and coherence in a discourse. This latest version provides 108 indices of a text, covering descriptive statistics of a text, text easability, referential cohesion, LSA, connectives, situation model, syntactic pattern density, word information and readability, and these metrics can be applied to nearly any types of texts and genres (Graesser and McNamara, 2011).
Given that the primary aim of this study is to examine how translation variety and register influence patterns of cohesion and coherence, our analysis focuses on a subset of Coh-Metrix indices most directly related to these two constructs. Specifically, we include features grouped into five core components: referential cohesion, personal pronouns, connectives, LSA and the situation model. These are summarised in Table 2. More information about the variables can be accessed in McNamara et al. (2014, pp. 247–251).
As outlined in section “Cohesion and coherence”, we classify referential cohesion, personal pronouns and connectives as indicators of cohesion, while LSA and the situation model are treated as proxies for coherence. Referential cohesion reflects the degree of overlap among nouns, pronouns and other content words that create continuity across sentences. Coh-Metrix calculates five types of coreference: noun overlap (CRFNO1 and CRFNOa), argument overall (CRFAO1 and CRFAOa), stem overlap (CRFSO1 and CRFSOa), content word overlap (CRFWO1 and CRFWOa) and anaphor overlap (CRFANO1 and CRFANOa). The personal pronoun category includes normalised frequencies (per 1000 words) of first-person singular (WRDPRP1s), first-person plural (WRDPRP1p), second-person (WRDPRP2), third-person singular (WRDPRP3s) and third-person plural (WRDPRP3p) pronouns, which reflect inter-personal cohesion in discourse. Connectives, as cohesive ties between clauses or sentences, are quantified by type (per 1000 words), including causal (CNCCaus), logical (CNCLogic), adversative/contrastive (CNCADC), temporal (CNCTempx), additive (CNCAdd), positive (CNCPos) and negative (CNCNeg). Semantic similarity, measured via LSA, focuses on “semantic overlap between explicit words and words that are implicitly similar or related in meaning” (McNamara et al., 2014, p. 66). Coh-Metrix provides measurement of this component at the level of adjacent sentences (LSASS1) and paragraphs (LSASSp). A unique measure of how much new versus old information exists (LSAGN) is also provided. According to McNamara et al. (2014, p. 66), sentence content is partitioned as given, partially given (based on various types of inferential availability) or new, and LSAGN serves as a proxy for how much given versus new information exists in each sentence in a text, compared with the content of prior text information. Finally, the situation model represents deeper, coherence-related cognitive structures that track causality, intentionality and temporal continuity. Relevant indices include: causal verbs (SMSCAUsv), causal verbs and particles (SMCAUSvp), intentional verbs (SMINTEp), ratio of causal particles to verbs (SMCAUSr), ratio of intentional particles to verbs (SMINTEr), LSA verbs overlap (SMCAUSlsa), WordNet verbs overlap (SMCAUSwn) and tense and aspect repetition (SMTEMP).
Descriptive statistics of these indices across registers and translation varieties are presented in Tables 3 and 4, respectively. Because the Coh-Metrix indices differ in scale and units, we standardised all variables into z-scores prior to conducting multivariate analyses. This step ensured comparability across metrics and allowed for the identification of relative patterns in cohesion and coherence features across text types.
Data analysis
The present study triangulates exploratory and confirmatory analyses by combining PCA, an unsupervised machine learning technique and FDA, a supervised machine learning technique. The method of triangulation in corpus-based translation studies has been widely accepted and adopted in a large number of earlier studies. For example, Evert and Neumann (2017) employed PCA and linear discriminant analysis (LDA) to investigate the shining-through effect (Teich, 2003) in German-English translations. De Sutter et al. (2012) and Delaere and De Sutter (2017) combined profile-based correspondence analysis and logistic regression model to investigate the influence of source languages and registers on onomasiological variants in Dutch translations. These studies serve as important methodological precedents and reinforce the validity of the combined approach adopted here.
In the current study, PCA was first employed to explore the underlying structure of the dataset without imposing any assumptions related to the grouping variables. As an unsupervised technique, PCA enables the visualisation of potential clustering among text samples—potentially reflecting register differences or translation varieties—based solely on their linguistic features. To evaluate whether the observed group separations were statistically significant, we subsequently conducted a PERMANOVA (Anderson, 2017), which provides a robust, non-parametric assessment of group differences based on distance matrices.
While PCA offers an exploratory overview, FDA was subsequently implemented to confirm and quantify the discriminative power of linguistic features in classifying texts according to register and translation variety. This method was chosen over LDA due to the violation of multivariate normality, as indicated by the Henze-Zirkler test (HZ = 1.01, p < 0.001). Unlike LDA, which assumes a multivariate normal distribution, FDA applies non-parametric regression techniques, making it more suitable for the distributional properties of our dataset (Hastie et al., 1994; Mallet et al., 1996). Importantly, the FDA not only classifies observations into predefined groups but also identifies the relative contribution of each predictor (i.e., linguistic feature) to the classification process. This enables the analysis to move beyond general group differences and pinpoint which aspects of cohesion and coherence most effectively differentiate registers and translation varieties. Through this, we are able to uncover meaningful patterns and provide interpretive depth to the observed variation. All analyses were performed using R Studio version 4.2 (R Core Team, 2023).
Results and discussion
This section encompasses the presentation of the research findings and discussion of the results. In the section “General results”, a general analysis based on PCA and PERMANOVA was conducted to identify which factor exerts a greater influence. Subsequently, confirmatory analyses of register variation, translation variety and their interactions were performed in sections “Register divergence”, “Translation variety” and “Interactions of register divergence and translation variety”, respectively. Section “Discussion” expounds upon these findings, elucidating how they contribute to addressing the three research questions.
General results
This section presents the overall results of the PCA, followed by cross-validation using PERMANOVA to assess the statistical significance of observed differences. Figure 1 displays scatterplots for the three dimensions generated by PCA, with ellipses representing 95% confidence intervals and density curves plotted along each axis to visualise distributional tendencies. As PCA is an unsupervised technique that does not incorporate categorical variables, the dimensions extracted may reflect variation attributable to registers, translation varieties or other latent factors. To facilitate interpretation, we overlay labels for both register and translation variety onto the PCA plots. Figure 1a presents the biplot of the first (x-axis) and second (y-axis) dimensions, which account for 34.9% and 12.2% of the total variance, respectively. The horizontal axis (dimension 1) appears to primarily reflect variation in registers. Specifically, academic and fictional texts are situated at opposite ends of the axis: academic texts toward the left and fictional texts toward the right, while general and journalistic texts occupy the middle positions. This suggests that dimension 1 captures a continuum of register-based discourse variation. By contrast, dimension 2 (as shown on the vertical axis of Fig. 1a and the horizontal axis of Fig. 1b) does not reveal a clear pattern associated with register. However, dimension 3 (on the vertical axis of Fig. 1b and horizontal axis of Fig. 1c), representing 9.3%, appears to differentiate literary from non-literary texts. Fictional texts cluster in the positive coordinate space, whereas journalistic, general and academic texts are predominantly located in the negative range.
Turning to translation variety, Fig. 1d (dimensions 1 and 2) does not show a strong separation between translated and non-translated texts. However, clearer distinctions emerge in Fig. 1e, f, where dimensions 2 and 3 are plotted along the horizontal axis. Along dimension 2, original English texts are concentrated between 0 and +5, while translated texts, both human and machine (DeepL and Google), are mainly positioned between −5 and 0. In Fig. 1f, the trend reverses along dimension 3, further suggesting that dimensions 2 and 3 jointly capture variation attributable to translation variety. Notably, within the translated group, there is substantial overlap between human translation and machine translations from DeepL and Google, while all translated varieties are slightly separated from non-translated texts. This suggests that dimension 3 may reflect a convergence of register and translation-based variation. Although translation variety does influence cohesion and coherence patterns, it does so to a lesser extent than register, based on the relatively larger variance of dimension 1 compared to dimensions 2 and 3.
This interpretation is reinforced by the results of the PERMANOVA analysis, summarised in Table 5. Euclidean distance was used to measure dissimilarity among samples, and the Bonferroni method was applied to adjust p-values. We tested the main effects of register and translation variety, as well as their interaction. All effects were found to be statistically significant. However, their relative contributions to variance differ markedly: register explains 18% of the variance, translation variety accounts for 7% and their interaction contributes only 2%. These findings align with the PCA results and substantiate the conclusion that register variation is the dominant factor shaping cohesive and coherent patterns in the texts analysed. In contrast, translation variety, though significant, exerts a comparatively smaller effect. The minimal interaction effect further suggests that register and translation variety influence discourse features in largely additive rather than synergistic ways.
Register divergence
To investigate how cohesion and coherence vary across registers and to uncover the characteristics of each register, we applied FDA and a post-hoc PERMANOVA test. This dual approach serves to both confirm whether meaningful variation exists across registers and to identify the specific linguistic features that contribute to such variation. Key discriminating variables were identified by examining their weights in the most informative FDA dimensions. In addition, Kruskal–Wallis tests and Dunn’s post-hoc comparisons were used to determine how these variables differ across the four registers.
Figure 2 presents the biplots of the three discriminants extracted through FDA. The model’s classification performance is acceptable, with an accuracy rate of 72% for the training data (70% of the total dataset) and 70% for the testing data (30%). This performance suggests that registers exhibit distinctive cohesive and coherent patterns that are reliably separable by statistical modelling.
In Fig. 2a, the x-axis represents discriminant 1 (66.48% of the variance), and the y-axis represents discriminant 2 (23.39%). The density curves along each axis further illustrate distributional patterns. The four registers are clearly separated in this plot, closely mirroring the PCA results. Specifically, fictional texts cluster between −2 and 0 on the horizontal axis, academic texts are situated on the far right, and journalistic and general texts fall between them. Figure 2b shows that discriminant 2 distinguishes general texts from the others, with journalistic texts occupying a middle position. In Fig. 2c, discriminant 3 (10.14%) appears to differentiate news texts from the remaining registers. Overall, the dominant role of discriminant 1, which captures over half of the total variance, highlights the central importance of register in shaping cohesive and coherent patterns.
These differences are statistically validated by pairwise PERMANOVA results in Table 6, which confirm significant variation across all four registers. The F-values reflect the magnitude of dissimilarity between register pairs, with higher values indicating more substantial differences. For example, academic texts differ most strongly from fictional texts (F = 382.35, df = 1, p < 0.001). The difference between journalistic and fictional texts is also pronounced (F = 194.11, df = 1, p < 0.001). Even between journalistic and general texts, where visual overlap is observed in the central PCA and FDA plots, a significant difference is present (F = 35.63, df = 1, p < 0.001), confirming the presence of nuanced but meaningful register-specific variation.
To identify the linguistic features responsible for these differences, we examined the variable weights in the first FDA dimension. Figure 3 displays the five most important contributors: lemmas overlap between two adjacent sentences (CRFSO1), third-person pronouns in plural forms (WRDPRP3s), logic connectives (CNCLogic), the ratio of new and given information measured by LSA (LSAGN) and the ratio of intentional particles to the intentional verbs (SMINTEr). Among these, features related to coreference and personal pronouns appear to play a particularly central role in register differentiation, suggesting that referential cohesion is a key driver of register variation.
To explore how these variables behave across registers, we conducted Kruskal–Wallis tests, followed by Dunn’s post-hoc comparisons. The results, presented in Fig. 4, indicate significant register-based differences in all five dimensions. The χ2 values represent the test statistic used to compare distributions across groups, and higher χ2 values indicate greater divergence between groups. In Fig. 4a, academic texts show the highest levels of cohesion measured by lemmas overlap, significantly exceeding all other registers (χ2 = 871.61, df = 3, p < 0.001), indicating a stronger reliance on lexical repetition for cohesion. Interestingly, Fig. 4b reveals that fictional texts are characterised by a significantly higher frequency of third-person plural pronouns, whereas academic texts show the lowest usage (χ2 = 847.37, df = 3, p < 0.001), consistent with narrative storytelling versus impersonal academic discourse (Biber, 1988). In Fig. 4c, the use of logical connectives is again most prominent in academic texts (χ2 = 72.46, df = 3, p < 0.001), aligning with their expository function and argumentative structure. Figure 4d shows that academic texts also demonstrate the highest semantic similarity (LSAGN) among these four registers, while fictional texts display the lowest levels (χ2 = 422.15, df = 3, p < 0.001), reflecting their preference for creative linguistic variation over semantic repetition. With regard to intentionality markers (Fig. 4e), academic texts also lead, showing the highest intentional coherence (χ2 = 631.07, df = 3, p < 0.001), reinforcing their deliberate rhetorical organisation.
a–e Display the associations between register variation and five representative variables: CRFSO1, WRDPR3s, CNCLogic, LSAGN and SMINTER, respectively. Asterisks indicate statistically significant differences between pairs, while “ns” denotes non-significant differences (*p < 0.05, **p < 0.01, ***p < 0.001).
In summary, academic texts are marked by high levels of referential cohesion, semantic similarity, logicality and intentionality, reflecting their formal and structured communicative purpose. In contrast, fictional texts tend to ensure narrative fluidity and character interaction through the use of third-person plural pronouns. Journalistic and general texts display more balanced cohesion and coherence profiles. In some features, such as the overlap of lemmas between sentences and third-person plural pronouns, their patterns are statistically indistinct, indicating a degree of stylistic convergence between these two registers.
Translation variety
Following the same procedure outlined in section “Register divergence”, another FDA was conducted to examine the differences among translation varieties. The model yielded an accuracy rate of 77% on the training dataset and 75% on the testing dataset, indicating a generally reliable classification performance. Figure 5 presents the biplots of the three discriminant dimensions generated by the second FDA model. In Fig. 5a, the x-axis represents the first discriminant dimension (56.04% of the variance), while the y-axis corresponds to the second (36.06%). Human-translated texts primarily fall within the positive range of the x-axis, whereas the other varieties cluster on the negative side, suggesting a clear distinction between human translations and the other three varieties along this dimension. In Fig. 5b, the horizontal axis (dimension 2) differentiates Google translations from the remaining varieties, with non-translations and DeepL translations occupying a central position. Figure 5c plots the third discriminant (7.9%) along the x-axis and the first dimension along the y-axis. While dimension 3 captures a relatively smaller portion of the variance, it visibly separates translated from non-translated texts based on the density curves, echoing the findings from the PERMANOVA and PCA analyses. Specifically, non-translated texts tend to cluster between 0 and +2, while translated texts span from −2 to 1 on the third dimension. However, it should be noted that despite the overall high accuracy of the model, the dimension capturing translations versus non-translations does not account for a very large proportion of variance. This is because the FDA model captures multidimensional distinctions among the four translation varieties.
To statistically validate these observations, a pairwise PERMANOVA was conducted. The results, presented in Table 7, indicate significant differences among the four translation varieties. Much larger disparities are observed in comparisons involving non-translated texts: Original English vs. DeepL Translation (F = 73.95, df = 1, p < 0.001), Original English vs. Google Translation (F = 72.32, df = 1, p < 0.001) and Human Translation vs. Original English (F = 64.97, df = 1, p < 0.01). Interestingly, although the two machine-translated varieties, DeepL and Google translations, also differ significantly from each other (F = 9.12, df = 1, p < 0.001), the magnitude of difference is relatively small since they share the same source text. These results suggest that machine translations exhibit a distinct cohesive and coherent profile, which is neither fully aligned with human translations nor with original texts.
To further interpret these differences, we examined the contribution of individual variables to the third discriminant dimension (Fig. 6). Although discriminant 3 explains only 7.9% of the variance, choosing this dimension for further analysis is because the primary goal of this study is not to maximise classification accuracy between translation varieties, but rather to investigate general tendencies of cohesion and coherence in translated texts. This approach follows a precedent set by Evert and Neumann (2017), who highlighted the value of interpreting lower-variance dimensions when they reveal linguistically meaningful patterns. Nonetheless, we acknowledge this limitation in the discussion and call for cautious interpretation.
Variables that contribute most to distinguishing translated from non-translated texts are primarily related to referential cohesion and semantic similarity (as measured by LSA), with lesser contributions from situation model, connectives and personal pronouns. Similarly, we focus on the most influential variable in each component for interpretive analysis. These include: the average number of shared lemmas between sentences (CRFSOa) for referential cohesion, first-person plural pronouns (WRDPR1s) for personal pronouns, frequency of negative connectives (CNCNeg) for connectives, semantic similarity between adjacent sentences (LSASS1) for LSA and frequency of causal verbs indicating changes of state (SMCAUSv) for situation models.
Figure 7 presents the results of Kruskal–Wallis tests on these five key variables, followed by Dunn’s test for pairwise comparisons. Overall, translated and non-translated texts exhibit systematically different patterns of cohesion and coherence. Specifically, Fig. 7a shows significant differences in the score of lemmas overlap between sentences in an entire text across the four varieties (χ2 = 55.79, df = 3, p < 0.001). Human-translated texts exhibit more referential cohesion than non-translated texts, but both types of machine translation demonstrate even higher levels of stem overlap. In contrast, Fig. 7b shows no statistically significant differences in the use of first-person plural pronouns across the four varieties (χ2 = 7.27, df = 3, p = 0.064), indicating that this particular aspect of cohesion may be less sensitive to translation variety. Figure 7c shows that negative connectives are more frequently used in original texts than in translations (χ2 = 214.38, df = 3, p < 0.001), suggesting that non-translations may present more contrastive or argumentative discourse relations. Similarly, Fig. 7d highlights a significant difference in semantic similarity (χ2 = 62.01, df = 3, p < 0.001), with both human and machine translations showing greater local coherence than original texts. For verb-based cohesion (Fig. 7e), human translations employ significantly more causal verbs than both non-translations and machine translations (χ2 = 172.83, df = 3, p < 0.001), reinforcing their tendency toward greater explicit causality. However, this finding contrasts with Jiang and Niu (2022), who reported that human translations tend to show higher semantic similarity (LSASS1) but lower usage of causal verbs (SMCAUSv) than original texts.
a–e Display the associations between translation variety and five representative variables: CRFSOa, WRDPR1s, CNCNeg, LSASS1 and SMCAUSv, respectively. Asterisks indicate statistically significant differences between pairs, while “ns” denotes non-significant differences (*p < 0.05, **p < 0.01, ***p < 0.001).
In summary, translated texts by both humans and machines tend to exhibit enhanced cohesion relative to non-translated texts. However, the increased explicitness is not uniform across all cohesive and coherent metrics, and machine translations tend to overrepresent certain cohesive features. These results underscore the complex, multidimensional nature of translation-induced variation in textual cohesion and coherence.
Interactions of register divergence and translation variety
Sections “Register divergence” and “Translation variety” examined the main effects of register and translation variety. However, it remains unclear whether the linguistic characteristics exhibited by translated texts are consistently distributed across different registers. This section aims to address this gap by investigating whether the phenomenon of explicitation can be observed consistently across the four specific registers in both human and machine translations. To this end, a register-specific analysis was conducted using Kruskal–Wallis tests followed by Dunn’s post hoc comparisons. The results are summarised in Table 8. In the table, the symbols ‘>’ and ‘<’ denote whether the mean rank of the first translation variety is higher or lower than that of the second. An asterisk ‘*’ indicates a statistically significant difference, while ‘ns’ refers to non-significant results.
A closer analysis of journalistic texts reveals that, for three out of five variables, translated texts exhibit higher values than non-translated texts, indicating a tendency toward explicitation. However, negative connectives continue to be used significantly more frequently in original texts. In contrast, the pattern in general texts is somewhat more nuanced. Human and machine translations tend to employ first-person plural pronouns, causal verbs and semantic overlaps more frequently than non-translations, though not all of these differences are statistically significant. Interestingly, while overlaps of lemmas between sentences occur more often in non-translated texts than in human translations, machine translations do not differ significantly from originals in this regard, suggesting a closer alignment with native patterns in this dimension. A different pattern is also observed in academic texts. Human translations display explicitation primarily through increased use of causal verbs. Machine translations, by contrast, show a greater tendency toward increased cohesion and coherence in terms of semantic similarity and lemma overlaps, though the latter is not statistically significant. In fictional texts, a clearer tendency toward explicitation is observed in both referential cohesion and LSA measures. However, causal verbs, which often serve to clarify logical connections, are not consistently more frequent in human translations in this register, indicating that explicitation may operate selectively depending on discourse function and narrative style. Conversely, negative connectives remain consistently more frequent in non-translated texts across all registers, challenging the idea that original texts tend to favour implicit cohesion strategies. These results highlight the genre-specific effect on the tendency towards explicitation.
In summary, while explicitation is a prominent feature of translated texts, it is not uniformly distributed across registers. Instead, its presence appears to be register-sensitive and context-dependent. This conditional distribution suggests that explicitation should not be regarded as a universally consistent feature of translation, but rather as one that interacts dynamically with genre conventions and translation varieties.
Discussion
The aforementioned analysis suggests that the observed discrepancies in cohesion and coherence are predominantly attributed to register variation, while translation varieties and their interaction exert a comparatively minor and marginal influence. Specifically, substantial differences in cohesion and coherence are found across the four registers examined, with particularly marked contrasts between academic and fictional texts. Furthermore, both human and machine translations exhibit a general tendency toward explicitation, with machine translations displaying an even stronger inclination toward increased cohesion and coherence in certain metrics. However, this tendency is not uniformly evident across all registers or variables, but is instead context-sensitive. The following discussion interprets these findings in relation to the study’s three research questions.
The first research question addressed how translated texts, both human and machine, differ from non-translated texts in terms of cohesion and coherence. The analysis reveals a clear overall tendency toward explicitation in both translation varieties, supporting prior findings and theoretical expectations. In particular, machine-translated texts show significantly higher levels of referential cohesion and semantic similarity, which may reflect the influence of algorithmic bias—that is, the tendency of machine learning models to favour more frequent or prototypical linguistic choices from their training data (De Clercq et al., 2021; Jiang and Niu, 2022; Luo and Li, 2022; Vanmassenhove et al., 2019, 2021). This tendency can be understood as a form of amplification of translational features, where machine translation exaggerates existing norms or tendencies found in human translation. However, explicitation in machine translation cannot be solely attributed to algorithmic bias. Other factors, such as training data, model architecture and algorithmic design choices, also play an important role (Jiang and Niu, 2022). For instance, Google Translate is typically based on recurrent neural networks (RNNs) and trained on broad digital resources, while DeepL relies on convolutional neural networks (CNNs) and draws from the Linguee bilingual corpus (Mouratidis et al., 2021; Ziganshina et al., 2021). Although both systems show explicitation tendencies, intra-system differences are also apparent: for example, Google Translate demonstrates a higher frequency of verb cohesion to express causality, whereas DeepL exhibits a more conservative approach in that regard.
The second research question seeks to address whether these translational characteristics are consistently observable across different registers. While explicitation is not uniformly present across all variables or registers, a general trend toward increased cohesion and coherence is still discernible in both human and machine-translated texts. This trend, particularly in human translation, can be partly interpreted through a risk-averse lens. According to Pym (2015, 2020), human translators often adopt risk-reduction strategies—increasing textual cohesion to enhance comprehension and provide clearer communicative cues for the reader. This strategic choice is especially relevant when translators operate under pressure to produce accurate and culturally acceptable outputs. Even in fiction, a register typically characterised by lower cohesion, translated texts often exhibit greater cohesion and coherence scores than their original English counterparts. Nevertheless, there are also exceptions, where human translators may engage in risk-taking behaviour, such as the use of lemmas that overlap between sentences in academic and general texts. As Pym (2015, 2020) suggests, the risks translators seek to mitigate are not confined to the text-translator relationship, but also include potential consequences involving readers, clients and other institutional stakeholders. Hu (2020), Sela-Sheffy (2005) and Robinson (2020) depicted how translators validate and internalise norms from the long-term interaction of their experience and the embedded environment, demonstrating that translators may undergo some dangers, negative outcomes or penalties if they disobey the designated norms. Hence, risk-taking behaviours are more likely to be avoided, while risk-aversion options are more likely to be favoured in general due to the entrenchment of experience. Nonetheless, the presence of implicitness in certain components of the translated texts also likely reflects source-language interference, or the shining-through effect (Teich, 2003), particularly given the tendency for Chinese source texts to exhibit implicit cohesion. This finding also aligns with Xiao and Hu (2015) and Zhang et al. (2020).
The third research question examined whether register or translation variety has a greater impact on patterns of cohesion and coherence. The triangulated results from PCA, PERMANOVA and FDA confirm that register exerts a more substantial influence than translation variety. This finding aligns with prior studies that emphasise the dominant role of register in shaping textual features, including those in translated texts (Diwersy et al., 2014; Kruger and Rooy, 2018; Neumann, 2014). Additionally, the interaction between translation variety and register underscores the contextual dependence of translational tendencies. As Delaere and De Sutter (2017, p. 106) argue, the general characteristics associated with translated language should not be treated as universal or homogeneous; rather, they are modulated by register-specific constraints and communicative goals. Therefore, any investigation into the linguistic features of translated texts must account for the interplay between translation and register, rather than isolating translation effects in a vacuum.
In summary, this study demonstrates that while both human and machine translations exhibit a tendency toward explicitation, this is neither uniform across all linguistic variables nor independent of contextual factors. Register emerges as a more powerful contributor to variation in cohesion and coherence than translation variety, and translational choices are shaped by a complex constellation of cognitive, social and technological influences. These findings call for a context-sensitive and multifactorial approach to the study of translated language, particularly as machine translation continues to evolve and interact with human translation practices.
Conclusion
The current study is related to the investigation of cohesive and coherent features shown in translated texts by incorporating the factors of translation variety and register divergence. The data were extracted from five components of Coh-Metrix, including referential cohesion, person pronouns, connectives, LSA and situation model, and the results were based on the triangulation of exploratory and confirmatory multivariate techniques. Several interesting findings were reported in this work: first, translated texts display a tendency towards explicitation at the general level, and human and machine-translated texts are characterised by different unique patterns in cohesion and coherence; second, these patterns are not consistently distributed in different registers and largely context-dependent; third, among the influencing factors, register variation exerts a greater impact on cohesion and coherence than translation.
This work is of significance due to its theoretical, methodological and practical implications. Theoretically, it underscores the importance of examining universal tendencies in translation, particularly explicitation, within their broader technological and functional contexts. Methodologically, the study adopts a triangulated analytical approach, combining PCA, PERMANOVA and FDA, to provide a robust and multidimensional picture of the cohesion and coherence features in translated versus non-translated texts. Practically, the findings call for a more nuanced consideration of machine translation in translation practices, especially regarding how its output diverges from that of human translators in terms of textual cohesion and coherence. This could inform future strategies for improving neural machine translation systems, post-editing processes and translator training.
Nevertheless, there are still several issues worthy of further exploration in the future. The current study focused on English translations from Chinese, where a degree of implicitation in connectives was observed, likely due to the shining-through effect, as Chinese tends to favour implicit cohesion. Future studies should investigate other language pairs to assess how source language typology influences cohesive patterns in translation. Moreover, additional factors such as translator experience, stage of translation (draft vs. final version), intended target audience, directionality (L1–L2 vs. L2–L1 translation) and language status (e.g., dominant vs. minority languages) could provide a more comprehensive understanding of the variation in translated texts. Finally, our analysis also suggests instances of risk-taking behaviour among translators in certain registers, despite the general tendency toward risk aversion. This highlights the importance of further investigating the socio-cognitive and contextual factors that influence when and why translators might adopt risk-taking strategies, moving beyond norms and constraints to consider agency, motivation and communicative intent.
Data availability
The data that support the findings of this study are available from the corresponding author upon reasonable request.
Change history
12 November 2025
In the original version of this article, the affiliation details for Author Jia Li and Yuan Gao were incorrectly given as 'Southwest University, Chongqing, China' and 'Shenzhen University, Shenzhen, China' but should have been 'College of International Studies, Southwest University, Chongqing, China' and 'School of Media and Communication, Shenzhen University, Shenzhen, China'. The article has now been corrected.
Notes
Original: ‘procédé qui consiste à introduire dans LA des précisions qui restent implicites dans LD, mais qui se dégagent du contexte ou de la situation”. The English translation is verified by Brian Nelson2, personal communication, 16 March 2016.
This corpus is compiled by Dr Richard Xiao and Dr Andrew Hardie at CASS in collaboration with Dr Dechao Li and Professor Chu-Ren Huang of the Hong Kong Polytechnic University, supported by the joint ESRC (UK)–RGC (Hong Kong) research project “Comparable and Parallel Corpus Approaches to the Third Code: English and Chinese Perspectives” (ES/K010107/1).
References
Anderson MJ (2017) Permutational multivariate analysis of variance (PERMANOVA). In: Kenett RS et al. (eds) Wiley StatsRef: statistics reference online (1st edn., pp. 1–15). Wiley. https://doi.org/10.1002/9781118445112.stat07841
Baker M (1993) Corpus linguistics and translation studies—implications and applications. In: Baker M, Francis G, Tognini-Bonelli E (eds.) Text and technology. John Benjamins Publishing Company. pp. 233–250. https://doi.org/10.1075/z.64.15bak
Baker, M. (1996). Corpus-based Translation Studies: The Challenges that Lie Ahead. In H. Somers (Ed.), Terminology, LSP and Translation: Studies in language engineering in honour of Juan C. Sager (pp. 175–186). John Benjamins Publishing Company. https://doi.org/10.1075/btl.18.17bak
Becher V (2010) Abandoning the notion of “translation-inherent” explicitation: against a dogma of translation studies. Across Lang Cult 11(1):1–28. https://doi.org/10.1556/Acr.11.2010.1.1
Biber D (1988) Variation across speech and writing. Cambridge University Press. https://doi.org/10.1017/CBO9780511621024
Blum-Kulka S (1986) Shifts of cohesion and coherence in translation. In: House J, Blum-Kulka S (eds). Interlingual and intercultural communication: discourse and cognition in translation and second language acquisition studies. Gunter Narr Verlag. pp. 17–35
Bublitz W (2011) Cohesion and coherence. In: Zienkowski J, Östman J-O, Verschueren J (eds) Discursive pragmatics. John Benjamins Publishing Company. pp. 37–49. https://doi.org/10.1075/hoph.8.03bub
Cappelle B, Loock R (2013) Is there interference of usage constraints?: a frequency study of existential there is and its French equivalent il y a in translated vs. Non-translated texts. Target Int J Transl Stud 25(2):252–275. https://doi.org/10.1075/target.25.2.05cap
Cappelle B, Loock R (2017) Typological differences shining through: the case of phrasal verbs in translated English. In: De Sutter G, Lefer M-A, Delaere I (eds) Empirical translation studies. New theoretical and methodological traditions. Mouton de Gruyter. pp. 235–264. https://doi.org/10.1515/9783110459586-009
Carrell PL (1982) Cohesion is not coherence. TESOL Q 16(4):479–488. https://doi.org/10.2307/3586466
Chesterman A (2011) Translation universals. In: Gambier Y, van Doorslaer L (eds) Handbook of translation studies: volume 2. John Benjamins Publishing Company. pp. 175–179. https://doi.org/10.1075/hts.2.tra12
De Clercq O, De Sutter G, Loock R, Cappelle B, Plevoets K (2021) Uncovering machine translationese using corpus analysis techniques to distinguish between original and machine-translated French. Transl Q 101:21–45
De Sutter G, Delaere I, Plevoets K (2012) Lexical lectometry in corpus-based translation studies. In: Michael P Oakes, Ji M (eds) Quantitative methods in corpus-based translation studies. John Benjamins. pp. 325–346. https://doi.org/10.1075/scl.51.13sut
Delaere I, De Sutter G (2013) Applying a multidimensional, register-sensitive approach to visualize normalization in translated and non-translated Dutch. Belgian J Linguist 27(1):43–60. https://doi.org/10.1075/bjl.27.03del
Delaere I, De Sutter G, Plevoets K (2012) Is translated language more standardized than non-translated language?: Using profile-based correspondence analysis for measuring linguistic distances between language varieties. Target Int J Transl Stud 24(2):203–224. https://doi.org/10.1075/target.24.2.01del
Delaere I, De Sutter G (2017) Variability of English loanword use in Belgian Dutch translations. Measuring the effect of source language and register. In: De Sutter G, Lefer M-A, Delaere I (eds). Empirical translation studies: new methodological and theoretical traditions. De Gruyter Mouton. pp. 81–112. https://doi.org/10.1515/9783110459586-004
Devlin J, Chang M-W, Lee K, Toutanova K (2019) BERT: Pre-training of deep bidirectional transformers for language understanding. In: Burstein J, Doran C, Solorio T (eds). Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Association for Computational Linguistics. pp. 4171–4186. https://doi.org/10.18653/v1/N19-1423
De Sutter G, Lefer M-A (2020) On the need for a new research agenda for corpus-based translation studies: a multi-methodological, multifactorial and interdisciplinary approach. Perspectives 28(1):1–23. https://doi.org/10.1080/0907676X.2019.1611891
Dijk TA, Kintsch W (1983) Strategies of discourse comprehension. Academic Press
Diwersy S, Evert S, Neumann S (2014) A weakly supervised multivariate approach to the study of language variation. In: Szmrecsanyi B, Wälchli B (eds) Aggregating dialectology, typology, and register analysis: linguistic variation in text and speech. De Gruyter. pp. 174–204. https://doi.org/10.1515/9783110317558.174
Englund Dimitrova B (2005) Expertise and explicitation in the translation process. Benjamins. https://doi.org/10.1075/btl.64
Evert S, Neumann S (2017) The impact of translation direction on characteristics of translated texts. A multivariate analysis for English and German. In: De Sutter G, Lefer M-A, Delaere I (eds) Empirical translation studies: new methodological and theoretical traditions. De Gruyter Mouton. pp. 47–80. https://doi.org/10.1515/9783110459586-003
Givon T (1995) Functionalism and grammar. In Z.74. John Benjamins Publishing Company. https://benjamins.com/catalog/z.74
Graesser AC, McNamara DS (2011) Computational analyses of multilevel discourse comprehension. Top Cogn Sci 3(2):371–398. https://doi.org/10.1111/j.1756-8765.2010.01081.x
Graesser AC, McNamara DS, Louwerse MM, Cai Z (2004) Coh-Metrix: analysis of text on cohesion and language. Behav Res Methods, Instrum, Computers 36(2):193–202. https://doi.org/10.3758/BF03195564
Halliday MAK, Hasan R (1976) Cohesion in English (1st edn). Longman. https://doi.org/10.4324/9781315836010
Halverson SL (2015) Cognitive translation studies and the merging of empirical paradigms: the case of ‘literal translation. Transl Spaces 4(2):310–340. https://doi.org/10.1075/ts.4.2.07hal
Hastie T, Tibshirani R, Buja A (1994) Flexible discriminant analysis by optimal scoring. J Am Stat Assoc 89(428):1255–1270. https://doi.org/10.2307/2290989
Hu B (2020) How are translation norms negotiated?: a case study of risk management in Chinese institutional translation. Target Int J Transl Stud 32(1):83–121. https://doi.org/10.1075/target.19050.hu
Hu X, Xiao R, Hardie A (2019) How do English translations differ from non-translated English writings? A multi-feature statistical model for linguistic variation analysis. Corpus Linguist Linguist Theory 15(2):347–382. https://doi.org/10.1515/cllt-2014-0047
Hundt M, Sand A, Siemund R (1999) Manual of information to accompany The Freiburg—LOB Corpus of British English (‘FLOB’). Department of English. Albert-Ludwigs-Universität Freiburg
Jiang Y, Niu J (2022) A corpus-based search for machine translationese in terms of discourse coherence. Across Lang Cult 23(2):148–166. https://doi.org/10.1556/084.2022.00182
Jiménez-Crespo MA (2015) Testing explicitation in translation: triangulating corpus and experimental studies. Across Lang Cult 16(2):257–283. https://doi.org/10.1556/084.2015.16.2.6
Johnson-Laird PN (1989) Mental models. In MI Posner (ed.). Foundations of cognitive science. The MIT Press. pp. 469–499. https://doi.org/10.7551/mitpress/3072.003.0014
Kajzer-Wietrzny M (2015) Simplification in interpreting and translation. Across Lang Cult 16(2):233–255. https://doi.org/10.1556/084.2015.16.2.5
Klaudy K, Károly K (2005) Implicitation in translation: empirical evidence for operational asymmetry in translation. Across Lang Cult 6(1):13–28. https://doi.org/10.1556/Acr.6.2005.1.2
Klaudy K (1998) Explicitation. In: Baker M (ed.). Routledge encyclopedia of translation studies. Routledge. pp. 80–84. https://doi.org/10.4324/9780203359792
Konšalová P (2007) Explicitation as a universal in syntactic de/condensation. Across Lang Cult 8(1):17–32. https://doi.org/10.1556/Acr.8.2007.1.2
Kruger H (2019) That again: a multivariate analysis of the factors conditioning syntactic explicitness in translated English. Across Lang Cult 20(1):1–33. https://doi.org/10.1556/084.001
Kruger H, De Sutter G (2018) Alternations in contact and non-contact varieties: reconceptualising that -omission in translated and non-translated English using the MuPDAR approach. Transl Cogn Behav 1(2):251–290. https://doi.org/10.1075/tcb.00011.kru
Kruger H, Rooy BV (2018) Register variation in written contact varieties of English: a multidimensional analysis. Engl World-Wide A J Varieties Engl 39(2):214–242. https://doi.org/10.1075/eww.00011.kru
Krüger R (2020) Explicitation in neural machine translation. Across Lang Cult 21(2):195–216. https://doi.org/10.1556/084.2020.00012
Kujamäki P (2004) What happens to “unique items” in learners’ translations? “Theories” and “concepts” as a challenge for novices’ views on “good translation”. In: Mauranen A, Kujamäki P (eds). Translation universals: do they exist? John Benjamins Publishing Company. pp. 187–204. https://doi.org/10.1075/btl.48.16kuj
Landauer TK, Dutnais ST (1997) A solution to Plato’s problem: the latent semantic analysis theory of acquisition, induction, and representation of knowledge. Psychol Rev 140(2):211–240. https://doi.org/10.1037/0033-295X.104.2.211
Lapshinova-Koltunski E (2015) Variation in translation: evidence from corpora. In: Fantinuoli C, Zanettin F (eds). New directions in corpus-based translation studies. Language Science Press. pp. 93–114. https://doi.org/10.17169/langsci.b76.64
Lapshinova-Koltunski E (2022) Detecting normalisation and shining-through in novice and professional translations. In: Granger S, Marie-Aude Lefer (eds). Extending the scope of corpus-based translation studies. Bloomsbury. pp. 182–206. https://doi.org/10.5040/9781350143289.0015
Laviosa S (2002) Core patterns of lexical use in a comparable corpus of English narrative prose. Meta 43(4):557–570. https://doi.org/10.7202/003425ar
Li D, Yang X (2017) A corpus-based study of more and more and its environments: insights from lexical priming theory. J Beijing Int Stud Univ 39(1):116. https://doi.org/10.12002/j.bisu.084
Liu K, Afzaal M (2021) Syntactic complexity in translated and non-translated texts: a corpus-based study of simplification. PLOS ONE 16(6):e0253454. https://doi.org/10.1371/journal.pone.0253454
Liu Y, Cheung AKF, Liu K (2023) Syntactic complexity of interpreted, L2 and L1 speech: a constrained language perspective. Lingua 286:103509. https://doi.org/10.1016/j.lingua.2023.103509
Luo J, Li D (2022) Universals in machine translation?: A corpus-based study of Chinese-English translations by WeChat Translate. Int J Corpus Linguist 27(1):31–58. https://doi.org/10.1075/ijcl.19127.luo
Mallet Y, Coomans D, De Vel O (1996) Recent developments in discriminant analysis on high dimensional spectral data. Chemometrics Intell Lab Syst 35(2):157–173. https://doi.org/10.1016/S0169-7439(96)00050-0
Marco J (2012) An analysis of explicitation in the COVALT corpus: the case of the substituting pronoun one(s) and its translation into Catalan. Across Lang Cult 13(2):229–246. https://doi.org/10.1556/Acr.13.2012.2.6
Marco J (2018) Connectives as indicators of explicitation in literary translation: a study based on a comparable and parallel corpus. Target Int J Transl Stud 30(1):87–111. https://doi.org/10.1075/target.16042.mar
McEnery A, Xiao Z (2004) The Lancaster Corpus of Mandarin Chinese: a corpus for monolingual and contrastive language study. In: Lino MT, Xavier MF, Ferreira F, Costa R, Silva R (eds). Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC’04). European Language Resources Association (ELRA). pp. 1175–1178. http://www.lrec-conf.org/proceedings/lrec2004/pdf/231.pdf
McNamara DS, Kintsch W (1996) Learning from texts: effects of prior knowledge and text coherence. Discourse Process 22(3):247–288. https://doi.org/10.1080/01638539609544975
McNamara DS, Graesser AC, McCarthy PM, Cai Z (2014) Automated evaluation of text and discourse with Coh-Metrix (1st edn.). Cambridge University Press. https://doi.org/10.1017/CBO9780511894664
Mikolov T, Chen K, Corrado G, Dean J (2013) Efficient estimation of word representations in vector space (No.). https://doi.org/10.48550/arXiv.1301.3781arXiv:1301.3781
Milošević J, Risku H (2021) Situated cognition and the ethnographic study of translation processes: translation scholars as outsiders, consultants and passionate participants. Linguist Antverpiensia New Ser—Theme Transl Stud 19. https://doi.org/10.52034/lanstts.v19i0.545
Mouratidis D, Stasimioti M, Sosoni V, Kermanidis KL (2021) NoDeeLe: a novel deep learning schema for evaluating neural machine translation systems. In: Mitkov R, Sosoni V, Giguère JC, Murgolo E, Deysel E (eds) Proceedings of the Translation and Interpreting Technology Online Conference. INCOMA Ltd. pp. 37–47. https://aclanthology.org/2021.triton-1.5
Muñoz Martín R (2016) Processes of what models?: On the cognitive indivisibility of translation acts and events. Transl Spaces 5(1):145–161. https://doi.org/10.1075/ts.5.1.08mun
Murtisari ET (2016) Explicitation in translation studies: the journey of an elusive concept. Transl Interpreting 8(2):64-81. https://doi.org/10.12807/ti.108202.2016.a05
Neumann S (2014) Contrastive register variation: a quantitative approach to the comparison of English and German. De Gruyter Mouton. https://doi.org/10.1515/9783110238594
Niu J, Jiang Y (2024) Does simplification hold true for machine translations? A corpus-based analysis of lexical diversity in text varieties across genres. Humanit Soc Sci Commun 11(1):1–10. https://doi.org/10.1057/s41599-024-02986-7
Olohan M, Baker M (2000) Reporting that in translated English. Evidence for subconscious processes of explicitation? Across Lang Cult 1(2):141–158. https://doi.org/10.1556/Acr.1.2000.2.1
Øverås L (1998) In Search of the third code: an investigation of norms in literary translation. Meta 43(4):557–570. https://doi.org/10.7202/003775ar
Pym A (2015) Translating as risk management. J Pragmat 85:67–80. https://doi.org/10.1016/j.pragma.2015.06.010
Pym A (2005) Text and risk in translation. In: Karin A, Cecelia A (eds) New tendencies in translation studies. Göteborg University. pp. 27–42
Pym A (2020) Translation, risk management and cognition. In: Alves F, Jakobsen AL (eds). The Routledge handbook of translation and cognition. Routledge. pp. 445–458. https://doi.org/10.4324/9781315178127
R Core Team (2023) R: A language and environment for statistical computing [Computer software]. R Foundation for Statistical Computing. https://www.R-project.org/
Rabadán R, Labrador B, Ramón N (2009) Corpus-based contrastive analysis and translation universals: a tool for translation quality assessment English→Spanish. Babel 55(4):303–328. https://doi.org/10.1075/babel.55.4.01rab
Redelinghuys K, Kruger H (2015) Using the features of translated language to investigate translation expertise: a corpus-based study. Int J Corpus Linguist 20(3):293–325. https://doi.org/10.1075/ijcl.20.3.02red
Risku H, Windhager F (2013) Extended translation: a sociocognitive research agenda. Target Int J Transl Stud 25(1):33–45. https://doi.org/10.1075/target.25.1.04ris
Robinson D (2020) Reframing translational norm theory through 4EA cognition. Transl Cogn Behav 3(1):122–142. https://doi.org/10.1075/tcb.00037.rob
Rowlands M (2010) The new science of the mind: from extended mind to embodied phenomenology. Bradford Books. https://doi.org/10.7551/mitpress/9780262014557.001.0001
Sela-Sheffy R (2005) How to be a (recognized) translator: Rethinking habitus, norms, and the field of translation. Target Int J Transl Stud 17(1):1–26. https://doi.org/10.1075/target.17.1.02sel
Shuttleworth M (1997) Dictionary of translation studies. Routledge. https://doi.org/10.4324/9781315760490
Sinclair J (1991) Corpus concordance and collocation. Oxford University Press España, S.A
Song H (2022) A corpus-based comparative study of explicitation by investigating connectives in two Chinese translations of The Lord of the Rings. Babel Rev Int de La Trad/Int J Transl 68(1):139–164. https://doi.org/10.1075/babel.00253.son
Teich E (2003) Cross-Linguistic variation in system and text: a methodology for the investigation of translations and comparable texts. Mouton de Gruyter. https://doi.org/10.1515/9783110896541
Tirkkonen-Condit S (2004) Unique items—Over- or under-represented in translated language? In: A Mauranen A, Kujamäki P (eds) Translation universals: do they exist?. John Benjamins Publishing Company. pp. 177–184. https://doi.org/10.1075/btl.48.14tir
Vanmassenhove E, Shterionov D, Gwilliam M (2021) Machine translationese: effects of algorithmic bias on linguistic complexity in machine translation. In: Merlo P, Tiedemann J, Tsarfaty R (eds) Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume. Association for Computational Linguistics. pp. 2203–2213. https://doi.org/10.18653/v1/2021.eacl-main.188
Vanmassenhove E, Shterionov D, Way A (2019) Lost in translation: loss and decay of linguistic richness in machine translation. In: Forcada M, Way A, Haddow B, Sennrich R (eds) Proceedings of Machine Translation Summit XVII: Research Track. European Association for Machine Translation. pp. 222–232. https://aclanthology.org/W19-6622
Vinay J, Darbelnet J (1958) Stylistique comparée du français et de l’anglais. Didier
Xiao R (2010) How different is translated Chinese from native Chinese?: a corpus-based study of translation universals. Int J Corpus Linguist 15(1):5–35. https://doi.org/10.1075/ijcl.15.1.01xia
Xiao R (2015) Contrastive corpus linguistics: cross-linguistic contrast of English and Chinese. In: Zou B, Smith S, Hoey M (eds) Corpus linguistics in Chinese contexts. Palgrave Macmillan UK. pp. 35–62. https://doi.org/10.1057/9781137440037_3
Xiao R, Hu X (2015) Corpus-based studies of translational Chinese in English-Chinese Translation. Springer Berlin Heidelberg. https://doi.org/10.1007/978-3-642-41363-6
Zhang X, Kotze (Kruger) H, Fang J (2020) Explicitation in children’s literature translated from English to Chinese: a corpus-based study of personal pronouns. Perspectives 28(5):717–736. https://doi.org/10.1080/0907676X.2019.1689276
Ziganshina LE, Yudina EV, Gabdrakhmanov AI, Ried J (2021) Assessing human post-editing efforts to compare the performance of three machine translation engines for English to Russian translation of Cochrane plain language health information: results of a randomised comparison. Informatics 8(1):9. https://doi.org/10.3390/informatics8010009
Zufferey S, Cartoni B (2014) A multifactorial analysis of explicitation in translation. Target Int J Transl Stud 26(3):361–384. https://doi.org/10.1075/target.26.3.02zuf
Zwaan RA, Magliano JP, Graesser AC (1995) Dimensions of situation model construction in narrative comprehension. J Exp Psychol Learn Mem Cogn 21(2):386–397. https://doi.org/10.1037/0278-7393.21.2.386
Acknowledgements
This work was supported by the Chongqing Municipal-Level Research and Innovation Project for Doctoral Students at Southwest University [Grant No. CYB240092]. Special thanks to Professor Dechao Li (The Hong Kong Polytechnic University) for providing access to the COCE corpus.
Author information
Authors and Affiliations
Contributions
JL contributed to conceptualisation, methodology, formal analysis, visualisation, funding acquisition and writing—original draft; YG contributed to validation and writing—revising and editing.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Ethical approval
This article does not contain any studies with human participants performed by any of the authors.
Informed consent
This article does not contain any studies with human participants performed by any of the authors. Therefore, informed consent is not applicable to the present research.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Li, J., Gao, Y. Variability of cohesion and coherence in Chinese-to-English translation: measuring the effect of translation variety and register divergence. Humanit Soc Sci Commun 12, 1526 (2025). https://doi.org/10.1057/s41599-025-05814-8
Received:
Accepted:
Published:
Version of record:
DOI: https://doi.org/10.1057/s41599-025-05814-8









