Introduction

Building on research initially rooted in organizational psychology that has demonstrated quantifiable associations between levels of psychological safety and team performance under pressure, as well as wellbeing and job satisfaction in individual team members1,2,3,4, a growing body of literature discusses the application of the concept in sports settings5,6. An increasing number of studies suggest that psychological safety is associated with adaptive outcomes in sports, for example, a good quality in the coach-athlete relationship, resilience, and mental health7,8,9. Most studies investigating psychological safety have utilized Edmondson’s Team Psychological Safety Scale (TPSS)3,10 aimed at performance development in professional teams. According to the operational definition of the TPSS, psychological safety is ‘a shared belief that the team is safe for interpersonal risk taking’[10, p. 354]. This definition characterizes psychological safety as mutual respect and trust among team members, where speaking up or being oneself does not lead to negative consequences. The emphasis is on open communication without fear of embarrassment or punishment, which is also linked to factors like leadership and organizational policies10,11,12,13.

In 2021, the International Olympic Committee (IOC) defined psychological safety in sports as ‘the creation of an athletic environment where athletes feel comfortable being themselves, can take necessary interpersonal risks, have the knowledge and understanding of mental health symptoms and disorders, and feel supported and comfortable in seeking help if needed’[14, p. 34]. This definition, in addition to interpersonal coherence and the aspect of open communication, encompasses the individual’s understanding of mental health and readiness for help-seeking. Given the contextual differences between organizational (e.g., business, healthcare) and sports settings, as well as the semantic gap in the interpretation of psychological safety across the contexts6, it is unsurprising that a systematic review of the literature on psychological safety in sports reported that only 30% (n = 67) of articles investigating this concept provided a clear definition5. The term psychological safety was in sports often used as a broad label to describe phenomena ranging from threat and harm to general impressions of inclusivity, equality, and respect. Based on their review, Vella and colleagues proposed defining psychological safety in sports as ‘the perception that one is protected from, or unlikely to be at risk of, psychological harm in sport’[5, p. 15].

One of the few instruments adapted for the measurement of psychological safety in sports is the Sport Psychological Safety Inventory (SPSI), which includes three subscales: Mentally healthy environment, mental health literacy, and low self-stigma15. The initial validation study conducted among Australian elite athletes and coaches supported a three-factor correlated structure. Low scores on the mentally healthy environment subscale and high scores on the low self-stigma subscale were associated with moderate mental health distress caseness, but scores on the mental health literacy subscale were not predictive of such distress15.

Although both the TPSS and SPSI have been developed to measure the concept of psychological safety and both scales have been applied in sports7,8,9,15,16, they diverge in their operational definitions. It remains unclear how these scales conceptually relate to each other and to important endpoints (e.g., health, coach-athlete relationship, performance) in sports, which has implications for the interpretation and conclusions of studies. The data collection for the SPSI validation was in addition performed during the early stages of the COVID-19 pandemic, a period characterized by exposure to strong, yet transient, psychological stressors. The authors therefore called for further validation of the psychometric properties of the inventory, as well as replication studies in diverse samples and cross-cultural settings15. Responding to the call for further analyses, the aim of this study was to investigate the psychometric properties, including internal consistency, factorial validity, construct validity and measurement invariance, of the TPSS and the SPSI in a Swedish elite sport context.

Methods

Participants

Swedish Athletics (track and field) athletes and orienteers, ranging from junior national sub-elite to senior international elite categories and aged ≥ 15 years, were invited to participate. A total of 371 athletes (Athletics: n = 233, females = 125; orienteering: n = 138, females = 73) completed the questionnaire. The mean ages were 18.72 years (SD = 4.73) for the Athletics sample and 18.93 (SD = 3.90) for the orienteering sample. Table 1 presents descriptive statistics for participants’ mean age, the age at which they began training in their sport, training hours per week, and the number of coaches they were currently trained by, categorized by competitive levels. The diverse group provided a comprehensive representation of the athletic spectrum within these sports.

Table 1 Descriptives of participants competitive levels, age, age when they started training the sport, training hours/week and number of coaches.

Study design and data collection

This study employed a cross-sectional design by using an online survey. The data utilized in this study are part of a larger data collection, which included standardized questionnaires related to mental health, psychological safety, and other environmental or health prerequisites in elite Athletics and orienteering. With support from the Swedish Athletics Federation and the Swedish Orienteering Federation, an invitation containing a QR code and a weblink to the survey was distributed to all National Sports High Schools, elite clubs, high-performance environments, and national teams within the respective federations. Data were collected from April 2023 to March 2024, and the survey was completed anonymously. Data collection for Athletics was conducted using the Lynes platform (lynes.io), while data collection for orienteering was conducted using the Artologik Survey&Report platform (artologik.com). The same survey was administered on both platforms. The transition of platform was driven by technical considerations and was not considered to impact on the quality of data collection.

Measures

Demographics collected included age, age when the participants had started training the sport, self-assigned gender, number of training hours/week and number of coaches they currently were trained by.

The Team Psychological Safety Scale (TPSS)10 consists of seven items and was back-translated from English to Swedish. Originally developed for use in organizations, the scale assesses team psychological safety, such as the extent to which team members feel safe taking interpersonal risks like admitting mistakes or asking for help. Respondents rate each item on a 7-point scale, ranging from 1 (“strongly disagree”) to 7 (“strongly agree”). Three items (item 1,3,5) are reverse scored. Total scores range from 7 to 49, with higher scores indicating greater perceived psychological safety. While support for the reliability and validity of the TPSS has been reported both in non-sports and sports contexts9,10,16, one study found potential problems related to item 6 when the scale was used in sports7.

The eleven-item Sport Psychological Safety Inventory (SPSI)15 was back-translated from English to Swedish. The SPSI operationalizes psychological safety into three subscales: mentally healthy environment (four items), mental health literacy (four items), and low self-stigma (three items). Respondents rate their answers on a five-point scale ranging from 0 (“strongly disagree”) to 4 (“strongly agree”). Three items (item 9,10,11) are reverse scored. Higher scores on the subscales indicate higher levels of perceived mentally healthy environment (total score range: 0–16), mental health literacy (total score range: 0–16), and lower self-stigma (total score range: 0–12). The initial validation provided support for the scale’s internal consistency and a three-factor correlated structure15.

A Swedish version of the fourteen-item Hospital Anxiety and Depression Scale (HADS) was utilized to assess anxiety (seven items) and depression (seven items)17,18. Responses are scored on a four-point scale (ranging from 0 to 3), with total scores for each subscale ranging from 0 to 21. Higher scores indicate greater levels of anxiety and depression symptoms, with a cut-off score of ≥ 11 recommended to identify probable cases of clinically significant anxiety or depression disorders19. The HADS is widely used in Swedish healthcare and has been extensively validated, demonstrating good psychometric properties17,18,19,20,21.

A Swedish version of the 11-item Coach-Athlete Relationship Questionnaire (CART-Q)22,23 was used to assess the coach-athlete relationship in terms of commitment, closeness, and complementarity. Respondents rated their responses on a seven-point scale ranging from 1 (“strongly disagree”) to 7 (“strongly agree”). The scale has been validated in various languages, demonstrating adequate psychometric properties22,23. In the present study, linguistic problems with the Swedish wording of item 2 (“I feel committed to my coach”) were identified during data screening, resulting in an unacceptable low McDonald’s omega. Problems with this item in the Swedish version of the CART-Q has also previously been identified23,24. Consequently, this item was removed in this study, while the remaining ten items were retained. The total scores for the 10-item version of the CART-Q in this study ranged from 10 to 70. A high score indicates a good quality coach-athlete relationship.

Statistical analyses

The sample characteristics, including means and standard deviations (SD), were analyzed using descriptive statistics, and scale reliability was calculated using McDonald’s omega (ω). Mann–Whitney U tests were conducted to explore differences between sports (Athletics athletes and orienteers) as well as between female and male athletes. Effect size for the Mann–Whitney U test (r) was calculated, with < 0.3 representing a small effect, and thresholds for medium and large effects being 0.3 and 0.5, respectively25. To evaluate the construct validity of the TPSS and SPSI, Spearman rank-order correlations were calculated with scores from instruments measuring the coach-athlete relationship (CART-Q) and mental health (HADS for anxiety and depression). Given that psychological safety as a construct has been suggested to be associated with a higher quality in the coach-athlete relationship and favorable conditions to support athletes’ mental health7,8,9, we hypothesized psychological safety scores on both scales to be positively related to CART-Q scores and negatively related to HADS scores. Both the Mann-Whitney U test and Spearman rank-order correlation are non-parametric tests appropriate for the ordinal data that were used in this study. None of the tests assume normal distribution of data, as they are based on ranks of scores. However, the Mann-Whitney U test assumes similar distribution shapes across independent groups, while the Spearman rank-order correlation assumes independent observations between pairs of variables25. Descriptive analyses and non-parametric tests were performed using SPSS Statistical Package version 29.

Confirmatory factor analyses (CFA) were conducted using MPlus version 8.8 to validate the factor structure of the measurement models for the TPSS and SPSI. Before conducting the CFA analyses, tolerance and variance inflation factor (VIF) was investigated to diagnose collinearity. Multicollinearity is indicated by a VIF above 4 or tolerance below 0.25 and no indication of collinearity was found in the data. Additionally, Mahalanobis distance was explored to detect multivariate outliers. The Mahalanobis distance measures the distance of a case from the centroid of the other cases, with the centroid being the point where the means of all variables intersect. A case is considered a multivariate outlier if it meets the chi-square (χ²) criterion with degrees of freedom and a significance level of p >.00126. To prevent multivariate outliers from disproportionately influencing the results and distorting the overall model fit, which could lead to misleading conclusions about the model’s adequacy, multivariate outliers were removed prior to conducting the CFA: s. Missing data were handled using pairwise deletion.

The four á priori hypothesized measurement models tested are displayed in Fig. 1. For the TPSS, and based on Edmundson’s original scale10, a one-factor hypothesized á priori measurement model was tested (Fig. 1a).

Fig. 1
figure 1

Á priori hypothesized measurement models tested for the TPSS (a) and the SPSI (b-d).

For the SPSI, three hypothesized á priori measurement models were tested based on findings in the initial validation study conducted by Rice et al.15:

  1. 1.

    A first order measurement model with one latent factor (Fig. 1b).

  2. 2.

    A first order measurement model with three latent correlated factors (mentally healthy environment, mental health literacy, low self-stigma (Fig. 1c).

  3. 3.

    A higher order measurement model with one higher order factor (psychological safety) and three latent factors (mentally healthy environment, mental health literacy, low self-stigma) (Fig. 1d).

To examine the interrelationship between the latent factors in the TPSS and the SPSI, a post hoc analysis was performed to analyze the scales together. The measurement models that displayed the most acceptable model fit for each scale were combined into a comprehensive model, with the latent factors from the two scales specified as correlated (see Fig. 2).

To assess the model fit of the hypothesized models, and because ordinal data was used, weighted least squares mean and variance (WLSMV) estimation was adopted to provide robust parameter estimates and standard errors27. Model fit was evaluated using the comparative fit index (CFI) and the root mean square error of approximation (RMSEA)28. A good model fit is indicated by CFI > 0.95 and RMSEA < 0.06. For RMSEA, values between 0.08 and 0.10 indicate a mediocre fit while values > 0.10 indicate a poor fitting model28,29,30.

Measurement invariance was tested to evaluate the equivalence of the scales (TPSS and SPSI) across gender. Measurement invariance evaluates if a construct is interpreted and assessed similarly across groups and is a prerequisite for group mean comparisons31,32. This involves analyses of increasingly constrained and nested models. First, configural invariance is established by analyzing the model fit achieved with only the factorial structure constrained across the groups of females and males. This step assesses the invariance of the dimensional model’s configuration across both groups, also serving as the baseline for further steps in the measurement invariance tests. In the second step, metric invariance is tested by constraining the factor loadings across gender. The third step focusses on scalar invariance where the item thresholds are required to be identical for both genders. We adopted the MPlus shortcut option that automatically runs multiple group models to test measurement invariance, using the settings configural, metric and scalar33. To compare if subsequently more constrained models are significantly different (p <.05) and thereby not invariant, the shortcut option provides chi-square difference testing with scalar corrections for WLSMV33,34,35,36. The chi-square difference test is an exact fit approach, but a limitation is that the test can be overly sensitive particularly when using large samples37. Indices of approximate fit have been discussed as a solution, usually by calculating CFI (∆CFI) or RMSEA (∆RMSEA) differences32,37,38. However, these indices are descriptive. There is no clear consensus on which fit indices and cut-offs should be used to assess misspecification under various conditions32,37,38. For example, simulation analyses show that ∆CFI may retain both well-fitting and poor-fitting models, imposing uncertainty regarding the appropriate cut-off32. The ∆RMSEA has been reported to lack sensitivity and could therefore potentially mask misfit, particularly for models with large initial degrees of freedom37. While additional indices are proposed (e.g., RMSEAD) they have also met objections37,39. A discrepancy between modification index values and chi-square difference tests in MPlus can also be observed when using WLSMV for ordinal data. This is due to the adjustments made in the chi-square difference test to accommodate this type of estimation34. Given the controversies surrounding the interpretation of various indices of approximate fit in measurement invariance testing and considering our use of WLSMV to account for ordinal data, we decided to evaluate measurement invariance using the chi-square difference test provided in the MPlus shortcut option. We judged this method to be more reliable than use of approximate fit indices, particularly because our sample size was not overly large. Statistical significance in all analyses was determined by a p-value < 0.05.

Ethics statement

The study was approved by the Swedish Ethical Review Authority (2022–03327-01). All participants were 15 years or older, and in accordance with Swedish ethical regulations, parental consent was not required. Participants provided informed consent on the initial survey question.

Results

Demographics

Mean and standard deviations of all scales for the two sports (Athletics and orienteering) as well as for self-assigned gender (female and male athletes) are shown in Table 2. Gender differences related to low self-stigma (SPSI) and anxiety (HADS) were revealed, with female athletes reporting lower self-stigma and higher anxiety scores than males. No other significant differences in the assessments across sports or gender were found. Table 2 also displays skewness, kurtosis, and McDonald’s omega (ω) for the scales. All scales demonstrated acceptable ω values (> 0.70) and were, except for the CART-Q, approximately normally distributed.

Table 2 Descriptive statistics of all scales for the two sports (Athletics and orienteering) and self-assigned gender. Skewness, kurtosis and McDonald’s omega (ω) for all scales are also displayed.

Construct validity

The strongest positive correlations between the psychological safety inventories and the validation instruments were observed for the TPSS and the SPSI subscale mentally healthy environment (Table 3). Although all psychological safety scales (TPSS, SPSI subscales) were significantly and negatively correlated with anxiety and depression scores, the TPSS and the subscale mentally healthy environment (SPSI) showed the strongest negative correlations.

Table 3 Associations between psychological safety inventories and validation instruments (Spearman rank-order correlations).

Confirmatory factor analyses (CFA)

Because no significant differences were found between Athletics athletes and orienteers’ mean scores on the psychological safety inventories (Table 2), the study participants were analyzed as one sample in the CFA. Data screening with Mahalanobis distance identified 12 multivariate outliers (χ2(7) ≥ 24.32, p ≤.001) for the TPSS. One case had incomplete data and 13 cases were excluded from further analyses, resulting in a final sample of 358 cases (females: n = 192; males: n = 166) used in the measurement invariance analyses of TPSS. For SPSI, 8 cases were identified as multivariate outliers and four were identified with incomplete data. The final sample used for the SPSI included 359 cases (females: n = 192; males: n = 167).

Results from all CFAs are presented in Table 4. The CFA conducted on the TPSS with one latent factor indicated a good model fit across all fit indices, while the one latent factor model of the SPSI revealed a poor model fit. Analyses of the proposed SPSI three-factor correlated model and the higher order model, with the higher order factor specified to load on three latent factors, showed model fit to be acceptable (with CFI indicating an excellent fit and the RMSEA suggesting a mediocre model fit).

Table 4 Confirmatory analyses for á priori hypothesised models of TPSS and SPSI with chi-square (Χ²) and degrees of freedom (df). Model fit evaluated by comparative fit index (CFI), root mean square error of approximation (RMSEA) with 90% confidence interval (CI).

Figure 2 presents the post hoc analysis where the one latent factor solution of the TPSS and the three-factor measurement model of the SPSI were analyzed within the same model. Mahalanobis distance identified 13 multivariate outliers (χ2(18) ≥ 42.31, p >.001) and four cases had incomplete data. Analyses were performed on 354 cases (females: n = 188; males: n = 166). The one latent factor of the TPSS was specified to correlate with the three latent factors of the SPSI. As shown in Table 4, this combined model demonstrated an acceptable model fit with all fit indices reaching acceptable levels. The strongest relationship between the latent factors of TPSS and SPSI was found between psychological safety (TPSS) and mental healthy environment (SPSI), while the relationships between psychological safety (TPSS), mental health literacy (SPSI) and low-self stigma (SPSI) were weaker.

Fig. 2
figure 2

Post hoc confirmatory factor analysis with the one latent factor measurement model of the TPSS and the three-factor correlated measurement model of the SPSI analysed in one model. Standardized correlations between the TPSS and the SPSI latent variables and standardized factor loadings.

Measurement invariance across gender

When females and males were analyzed separately, the TPSS (the first order model with one latent factor) displayed an acceptable to mediocre model fit for both genders (females: χ2 = 32.36(14), p <.001, CFI = 0.98, RMSEA = 0.08; males: χ2 = 30.11(14), p <.001, CFI = 0.98, RMSEA = 0.08). The SPSI (the first order model with three correlated latent factors) displayed an acceptable to poor model fit (females: χ2 = 138.74(41), p <.001, CFI = 0.97, RMSEA = 0.11; males: χ2 = 110.23(41), p <.001, CFI = 0.97, RMSEA = 0.10).

Measurement invariance tests were performed for the TPSS and SPSI respectively (Table 5). The shortcut chi-square test for difference testing suggested the TPSS to be metric and scalar invariant across genders. For the SPSI, the shortcut chi-square test for difference testing suggested the model to be metric but not scalar invariant across genders. To explore invariance of individual thresholds they were constrained on by one. The scalar-metric comparisons showed all single thresholds tested to be significant (p <.001) indicating them to be non-invariant.

Table 5 Measurement invariance (configural, metric and scalar) of the TPSS and the SPSI across genders. Configural invariance denotes the model fit achieved with only the factorial structure constrained across the groups of females and males, also serving as the baseline. Metric invariance is measured by constraining the factor loadings across gender. Scalar invariance requires the item thresholds to be identical for both genders.

Discussion

This validation study of instruments measuring psychological safety confirmed the internal consistency of the investigated scales and their proposed factor structures: a one-factor solution for the TPSS and a three-factor correlated solution for the SPSI. Consistent with the findings of Rice et al.15, a one-factor solution for the SPSI was not supported and a higher order model was not found superior to the three-factor correlated solution. The TPSS was found to be fully invariant across genders, while scalar invariance was not supported for the SPSI. Indications of non-invariance across gender present a significant challenge for researchers aiming to conduct gender comparisons with the scale. When invariance is questionable, any observed score differences may reflect measurement bias rather than true differences in the construct, rendering such comparisons scientifically meaningless32,40. Further research is desirable to investigate the measurement invariance of the scales across genders, sports, cultures and other groups that may be of interest for comparisons. Our results, however, suggest that if researchers are faced with the choice between the scales for studying gender differences related to psychological safety in sports, the TPSS may be preferable to the SPSI.

Regarding construct validity, the TPSS correlated with the indicators of mental health and the quality of the coach-athlete relationship in the theoretically expected direction. The mentally healthy environment subscale of the SPSI exhibited a similar pattern as the TPSS. Overall, the moderate strength of the correlation between the TPSS and the SPSI mentally healthy environment subscale when the two scales were jointly analyzed in the post hoc CFA suggests that these two scales partly, but not entirely, target a similar concept. The other two subscales, mental health literacy and low self-stigma, exhibited a divergent pattern suggesting that they measure constructs that are conceptually distinct from both the TPSS and the mentally healthy environment subscale. These findings are important, yet anticipated, given the semantic differences in the definition of psychological safety across organizational and sports contexts2,3,14,15. Psychological safety has been extensively investigated in organizational settings, with several theoretical perspectives proposed to explain its mechanisms at different levels (individual, team, or organizational) and its influence on work outcomes3. In comparison, an aim of introducing the concept in sports appears to have been the identification of predictors of future mental health, as reflected in both the definition proposed by the IOC and the SPSI developed from this perspective14,15. However, the specific purpose of the application of the psychological safety concept in sports remains unclear, which also is noticed in that the transfer of the organizational meaning to sports settings has been contested6. The IOC publication14 that presents the definition of psychological safety which the SPSI builds upon offers limited guidance because references to empirical scientific studies are lacking. This raises the question of whether describing the SPSI as a ‘sport psychological safety scale’ is constructive. Despite an acceptable model fit and internal consistency, the SPSI seems to lack a clear, empirically supported definition or theoretical foundation to guide researchers’ interpretation of scores obtained with the scale. In other words, it is unclear what the scale truly measures.

The empirical knowledge on how psychological safety in sports is perceived and influenced by various factors, as well as its relationship to different outcomes (e.g., performance, health, long-term development, motivation) is currently limited. The diverse and vague descriptions pose a risk of constraining scientific progress and practical assessments of psychological safety in sports5,6. Experiences from outside the sports domain suggest that researchers need to study not only benefits but also potential drawbacks in various settings related to psychological safety3. It is essential to ensure that recommendations related to psychological safety in sports are founded on empirical studies with high methodological quality including valid assessments. This implies that continued research is warranted on what the SPSI measures by comparing its subscales to existing scales, such as those for mental health literacy41,42 and stigma43,44. The domain (i.e., the target concept, attribute, unobserved behavior, etc.) should be clearly articulated and defined. A well-defined, theoretically supported domain is crucial for establishment of construct validity and the boundaries of the construct that the scale should assess45. Moreover, the existing literature should be reviewed to establish whether present instruments could serve the same purpose as the intended new scale. If similar scales exist, a justification for developing a new scale is required, along with an explanation of how it differs from existing instruments45. Finally, when adopting the TPSS and SPSI in sports, researchers should be cautious of the jingle fallacy, which occurs when two different scales are assumed to assess the same construct because they share the same name, but in fact, assess different constructs46. Jingle fallacies can lead to confusion and misinterpretation, making it challenging to compare and integrate findings across studies. When transferring a concept from one setting to another, which applies to the TPSS and SPSI in the sports setting, researchers should also be actively aware of the risk of concept creep, which can distort the original meaning of the term through semantic shifts and subsequently undermine the scientific and practical value of the construct47.

This study offers new insights into the psychometric properties of two scales used to measure psychological safety in sports. However, some limitations should be noted when interpreting the results. Despite the sample compromised elite athletes across a range of ages, from junior elite to senior elite levels, it was predominantly composed of young developing athletes. Additionally, the study included only individual sports athletes, specifically Athletics athletes and orienteers. It is possible that psychological safety, when assessed according to its organizational meaning, is a more significant construct for use with athletes participating in team sports than individual sports. This hypothesis could not be tested in this study. The population studied was from a single Scandinavian country, and the cultural and educational background may also influence the results. In addition, the study did not include any coaches, support staff or other groups involved in sports environments. Therefore, future research should include both individual and team sports, as well as coaches and staff from various countries and sporting levels, to further evaluate the psychometric properties of the scales.

In conclusion, the results of this study underscore that psychological assessments used in sports should be based on judiciously developed operational definitions and carefully validated. The TPSS exhibited acceptable psychometric properties for assessing psychological safety in an elite sports context. While the SPSI three-factor correlated model demonstrated a robust factor structure and internal consistency, it was not invariant across genders. Concerns about its construct validity were also raised. These findings underscore a need for caution when using the SPSI as a measure of psychological safety in sports settings.