Introduction

Observational studies are inherently vulnerable to various types of bias due to the absence of randomization. Therefore, effectively controlling bias in these studies is essential to enhance the validity of the results, ensure generalizability, and inform decision-making. It is crucial to follow established guidelines, such as the Strengthening the Reporting of Observational Studies in Epidemiology (STROBE) guidelines, to ensure high-quality reporting in observational studies1.

Mitigating bias presents a significant challenge for researchers, and several strategies have been proposed to address confounding variables, including the use of matching estimators2. Propensity score matching (PSM) is a statistical method to reduce confounding bias in observational studies first introduced in 1983 by Rosembaum and Rubin3. This method has gained wide acceptance for addressing confounding in observational studies. When properly implemented, it offers value in enhancing covariate balance between treatment groups and in approximating the conditions of a randomized controlled trial, thereby strengthening causal inference4. This strategy combines matching with the propensity score to approximate a quasi-randomized environment5. Also, propensity score methods can be used not only to reduce confounding but also to define or approximate specific target populations, allowing researchers to emulate the conditions of different randomized controlled trials within observational data6. This methodology has gained widespread application in kidney transplantation studies, where researchers must consider multiple patient covariates7.

This method offers a robust analytical framework that minimizes confounding factors, ultimately leading to stronger evidence-based practice8,9,10. PSM is particularly advantageous in scenarios where randomization is impractical or impossible, making it a powerful tool for observational studies in kidney transplantation11. The purpose of this review is to explain PSM, explore its practical applications in the field of kidney transplantation, and provide a practical example.

Applications of PSM in kidney transplant research

In kidney transplant studies, common confounding variables that can affect outcomes include age, sex, dialysis duration, and comorbidities12,13,14,15. The use of PSM can be particularly useful in various scenarios within this field. For example, PSM enables the comparison of outcomes from different therapies in transplanted patients, providing a robust methodology to control for confounding variables and derive more accurate conclusions16. Moreover, this technique is valuable for evaluating the impact of pretransplant conditions on posttransplant outcomes, helping to identify which preexisting factors may influence patient recovery and survival17,18. Additionally, PSM is used to identify factors associated with different outcomes in transplanted patients, facilitating an understanding of variables that can affect transplant effectiveness and posttransplant quality of life7,19,20,21. These applications underscore the versatility of PSM in kidney transplant research, allowing researchers to explore a wide range of questions related to treatment efficacy and patient outcomes11.

Methodology: implementing PSM

Covariate selection

The starting point in any PSM approach is selecting the right covariates. These should include variables that are related both to the likelihood of receiving the treatment and to the outcome, since failing to control for such confounders can bias the results22,23. To strengthen this step, it is advisable to draw on existing literature and expert knowledge, which can help ensure that all relevant factors are taken into account11.

However, not all variables should be included. It’s crucial to distinguish between confounders, which must be adjusted for, and mediators, which lie along the causal pathway and may distort the estimated treatment effect if included. To navigate these decisions systematically, researchers can use tools such as Directed Acyclic Graphs (DAGs), which offer a clear visual representation of hypothesized relationships and help guide covariate selection in a defensible way24.

Estimating propensity scores

Once the covariates are defined, the next step is to estimate each individual’s probability of receiving the treatment. This is commonly done using logistic regression, which models the likelihood of treatment assignment based on the selected covariates. While logistic regression remains the standard, alternatives such as probit models or machine learning techniques (e.g., random forests or gradient boosting) may be better suited to complex scenarios or multi-arm settings. In any case, including interaction terms or nonlinear functions often helps improve model fit and enhances balance across groups. Notably, since the goal is not prediction but covariate balancing, overfitting is not considered a major issue here23.

More recently, approaches like Covariate Balancing Propensity Scores (CBPS) have gained traction. Unlike traditional methods that separate prediction from balance, CBPS directly integrates both objectives, simultaneously optimizing model fit and covariate balance. This makes it particularly useful when there is concern about model misspecification or limited sample size25.

In line with these balance-focused approaches, recent work by Li et al. has demonstrated that enforcing balance directly through estimation methods—such as CBPS and entropy balancing—can lead to more accurate and less biased treatment effect estimates, particularly when traditional models are misspecified. Their simulation studies showed improvements in bias, variance, and mean squared error, supporting the utility of balance-focused methods26 Similarly, Huan et al. proposed a flexible weighting strategy that chooses between global and local scores based on balance quality. This method has proven effective for multi-site survival analyses and performs comparably to individual-level data pooling27.

Matching

After estimating propensity scores, researchers must choose how to match treated and untreated individuals. The decision should be informed by study goals, sample size, and the distribution of the propensity scores. In the practical exercise, we used 1:1 nearest neighbor matching, which is straightforward and maintains interpretability. Nonetheless, other strategies may be more efficient. For instance, one-to-many (K:1) matching allows each treated subject to be matched with multiple controls, improving precision and reducing standard errors. Evidence from simulations and applied studies suggests that variable-ratio matching often results in lower mean squared error with only a small trade-off in bias28.

Each method brings its own strengths and limitations. Nearest neighbor matching is easy to apply but may yield poor matches in the absence of strong overlap. Caliper matching mitigates this by setting a maximum acceptable difference in propensity scores between matched individuals, though it may reduce the number of matches. Meanwhile, optimal matching seeks to minimize the overall distance across all matched pairs but can be more computationally intensive. In all cases, it is critical to evaluate post-matching balance and justify the selected strategy in light of the dataset’s characteristics11,29.

Assessing balance

Once matching is completed, it is essential to assess whether the groups are now comparable in terms of baseline covariates. The most widely recommended metric is the standardized mean difference (SMD), which is not affected by sample size and provides a directional indication of imbalance. An SMD below 0.1 in absolute value is generally considered acceptable30. Complementary to numerical metrics, visual diagnostics—such as histograms or box plots—can help detect residual imbalances and verify that the matching process worked as intended22. This step is fundamental for establishing the internal validity of the treatment effect estimate11.

Sensitivity analysis

Even when good balance is achieved, it is important to assess whether the results are robust to reasonable changes in the matching procedure. Sensitivity analyses may involve varying the caliper width (e.g., 0.05, 0.1, 0.15) or trying alternative matching methods such as optimal or full matching. By comparing the treatment effect estimates across different configurations, researchers can determine whether findings are consistent or dependent on specific modeling choices23.

Estimating treatment effects

With covariates balanced and matching complete, the treatment effect can be estimated. This is typically done by regressing the outcome on the treatment variable in the matched sample. Because the matching process already adjusted for confounding, it is not necessary to reintroduce covariates into the regression model. This approach ensures that each treated subject is compared only to a comparable control and allows for clean interpretation of the treatment effect12.

Practical considerations and common pitfalls

While PSM offers a robust framework for addressing confounding, several practical aspects must be considered to ensure its validity.

First, although selecting appropriate covariates has been addressed earlier, it is worth emphasizing that omitting key confounders or including irrelevant variables can still compromise results—either by introducing residual bias or by increasing variance, particularly in small samples11,29.

Second, adequate overlap in propensity scores between treatment groups is essential. A lack of common support can hinder valid comparisons and may require trimming unmatched individuals, potentially reducing statistical power22. Visualizing score distributions can help diagnose this issue early in the process.

Third, match quality should not be assumed. Post-matching balance diagnostics, such as SMD, remain essential to confirm that the procedure was successful and that the matched groups are truly comparable22,25.

Finally, it is important to recognize that matching inherently reduces the available sample size. In settings with limited data, this may threaten power and precision. When excessive data loss occurs, alternative approaches—such as inverse probability weighting or covariate adjustment—may offer a more efficient solution23.

Practical example of PSM in kidney transplant research

In this section, we conduct a practical exercise in which PSM is applied to a publicly available dataset focused on kidney transplants. In this example, diabetes status is treated as the exposure variable, and the propensity score is estimated as the probability of having diabetes given relevant covariates (such as age, dialysis time, sex, blood type, and subregion). Matching is performed to balance these covariates between patients with and without diabetes, enabling a comparison of transplant outcomes between these groups.

This example has been intentionally simplified into a binary outcome analysis (transplanted or not) to illustrate the steps involved in implementing PSM. It does not account for the probability of transplantation as a time-to-event process with competing risks (e.g., death or permanent waitlist removal), and continuous variables are dichotomized while certain confounders are excluded to enhance didactic clarity.

The objective is to perform a PSM analysis to examine the impact of a diabetes diagnosis on the likelihood of receiving a transplant among patients on a waitlist. This practical example is conducted in RStudio via R version 4.3.331. By following this example, we aim to demonstrate the step-by-step implementation of PSM, emphasizing the importance of meticulous covariate selection and methodological rigor in observational studies. This exercise provides a detailed and systematic approach to applying PSM, ensuring the robustness and validity of the findings. Furthermore, this example serves as a foundational guide for the application of this methodology, thereby enhancing the overall quality and reliability of research in this field.

Dataset description

The dataset used here is sourced from Kaggle, titled “Waitlist Kidney Brazil”32. It includes patient demographics, clinical factors, and treatment details such as age, time on dialysis, race, sex, underlying disease, diabetes status, blood type, subregion, and transplant outcomes.

Running the analysis directly in R

To execute the full PSM analysis directly, simply copy and paste the script shown in Fig. 1 into RStudio. This script automatically downloads the complete R Markdown file containing the analysis and the kidney transplant waitlist dataset from GitHub and runs the workflow in a fully reproducible way. Both the R Markdown file and dataset, while hosted on GitHub, are also permanently archived and accessible via Zenodo33. The output appears as an HTML report within your R session, without needing to save files manually. This HTML report is also included as Supplementary File 1.

Fig. 1
figure 1

One-click, fully reproducible PSM analysis script. Installs required packages, downloads the R Markdown from GitHub, renders to HTML, and opens the report; figures are generated non-interactively.

Step-by-Step implementation of the PSM analysis

Before starting the main analysis, an automatic script prepares all necessary elements to ensure reproducibility. This script installs and loads the required R packages (MatchIt, dplyr, readr, cobalt, ggplot2, and gtsummary) from a stable CRAN mirror, downloads the kidney transplant waitlist dataset from GitHub, standardizes column names, corrects character encoding, and recodes the outcome variable (Transplant_Y_N) as a binary indicator (1 = Yes, 0 = No). It also recategorizes key variables such as age and dialysis duration into clinically meaningful groups and removes incomplete cases, generating a clean dataset (data_filtered) ready for matching. Although dichotomizing continuous variables may introduce limitations, such as masking residual imbalance across the full range of values, this step was implemented strictly for didactic purposes to streamline the example and facilitate replication. It is not intended as a methodological recommendation. The full setup code is included in the R Markdown file, although it is not displayed in the HTML output for clarity.

Step 1: sample size before matching

Initially, the analytic dataset size is assessed using the nrow() function to count complete cases. The baseline sample includes 46,817 observations, providing a reference point for evaluating the impact of subsequent matching on sample size and data retention. In addition, Table 1 summarizes baseline characteristics overall and by diabetes status.

Table 1 Baseline characteristics before matching, by diabetes status. Values are n (%) by column.

Step 2: unadjusted association between exposure and outcome

Prior to matching, a logistic regression (glm) estimates the crude association between the exposure (e.g., diabetes) and outcome (receiving a transplant). In this unadjusted model, the exposure coefficient reflects the log-odds of the outcome, providing an initial baseline. A statistically significant result here (log-odds = 0.41, p < 0.001) indicates a strong, yet potentially confounded relationship between diabetes and transplant probability.

Step 3: propensity score matching procedure

To minimize confounding and enhance the comparability between individuals with and without diabetes, we applied PSM using the matchit() function from the MatchIt package. The approach selected was nearest neighbor matching, supplemented with exact matching on key categorical variables—race, blood type, and subregion—to ensure that pairs were only formed within identical strata. Additionally, a caliper width of 0.2 standard deviations was imposed, restricting matches to those with closely aligned propensity scores. This rigorous combination of methods maximizes the quality and comparability of the matched pairs, substantially reducing potential systematic bias. Inevitably, some cases remain unmatched due to these strict criteria; however, this is an expected and acceptable trade-off, as it prioritizes analytic rigor and matching quality over sheer sample size. Because sex showed residual imbalance in the baseline match (see Steps 5–6), we also examined tighter calipers (0.05–0.15) and an exact-on-sex specification in sensitivity analyses (Steps 8–9).

Step 4: sample size after matching

Following the matching process, we used the nrow() function to determine the number of observations retained. The final matched cohort consisted of 19,504 individuals, reflecting a considered trade-off between preserving statistical power and enhancing covariate balance through stringent matching criteria. Such a reduction in sample size is expected in rigorous propensity score analyses, where the primary focus is on ensuring high-quality matches to reduce potential bias.

Step 5: assessing covariate balance numerically

To determine whether the matching procedure effectively balanced the main covariates, we calculated SMD for each variable before and after matching using the bal.tab() function from the cobalt package. The most relevant SMD results are summarized in Table 2. After matching, nearly all covariates exhibited SMD values close to zero, reflecting successful reduction of imbalance between the diabetes and non-diabetes groups. The only exception was a minor residual imbalance in the “Male” variable. Specifically, the post-match SMD for sex was 0.156 (above the 0.10 threshold), which motivated the sensitivity checks reported in Steps 8–9.

Table 2 Standardized mean differences before and after propensity score matching for main covariates. SMD = standardized mean difference.

Step 6: visualizing covariate balance with a love plot

To complement the numeric summary, we generated a Love plot using the love.plot() function from the cobalt package, which offers a visual representation of the SMDs reported in Table 2 (Fig. 2). In this plot, each covariate appears on the y-axis and SMDs are shown on the x-axis. Red dots represent SMD before matching, and blue dots indicate SMD after matching. The vertical dashed line at 0.1 denotes the threshold for acceptable covariate balance. As shown in Fig. 2, after matching, SMDs for all main covariates—except for a slight imbalance in the “Male” variable—fall well below the threshold, visually confirming the success of the matching procedure Consistent with Step 5, sex remains slightly above the 0.10 line in the baseline match.

Fig. 2
figure 2

Love plot of standardized mean differences for key covariates before (red) and after (blue) matching. The vertical dashed line at 0.10 marks the balance threshold.

Step 7: assessing propensity score overlap and distribution

To evaluate whether the PSM procedure achieved adequate overlap and comparability between groups, we examined the distribution of propensity scores using two complementary visual diagnostics generated with the plot() function from the MatchIt package. First, the jitter plot (Fig. 3) displays the distribution of individual propensity scores among treated (diabetes) and control (non-diabetes) units before and after matching. This plot shows that, after matching, most treated and control observations lie within a common range of propensity scores, supporting the validity of comparisons within the matched sample. Second, the set of histograms (Fig. 4) shows the proportion of subjects at each propensity score interval for both groups, before and after matching. The close alignment of these distributions in the matched samples further demonstrates that the matching process produced analytic groups with similar baseline characteristics. Collectively, these visualizations confirm that the matched dataset achieves substantial overlap in propensity scores, which is critical for unbiased estimation of treatment effects in subsequent analyses.

Fig. 3
figure 3

Jitter plot of propensity-score distributions for treated and control groups before and after matching.

Fig. 4
figure 4

Histograms of propensity scores for treated and control groups before (left) and after (right) matching. Greater alignment post-matching indicates improved comparability.

Step 8: sensitivity analysis by caliper width

To test the robustness of findings, a sensitivity analysis is conducted with different caliper widths (0.05, 0.10, 0.15) in the matching process. For each caliper, logistic regression estimates the diabetes effect on transplant likelihood. Results are consistently positive and statistically significant (caliper 0.05: 0.17; caliper 0.10: 0.25; caliper 0.15: 0.27; all p < 0.001), indicating stable results despite variations in matching strictness. This stability reinforces confidence in the robustness of the primary findings. Importantly, sex balance improved to SMD < 0.10 at calipers 0.05 (− 0.067) and 0.10 (− 0.078), but not at 0.15 (− 0.142).

Step 9: sensitivity analysis by matching method

To check robustness, three nearest-neighbor variants were run at a caliper of 0.10: without replacement, with replacement, and exact-on-sex. In every case, balance on sex improved to an absolute SMD < 0.10. Notably, the exact-on-sex variant achieved perfect balance on sex (SMD = 0.000) while keeping a matched sample size (N) similar to the no-replacement design, and it produced a diabetes coefficient of ≈ 0.216 (standard error, SE, ≈ 0.034; p < 1 × 10⁻⁹).

Step 10: adjusted association after matching

Finally, logistic regression analysis (glm) was fitted to the matched dataset to estimate the adjusted association between diabetes and the probability of transplant. For inference, dependence within matched sets was addressed using cluster-robust (“sandwich”) standard errors (SEs) at the matched-set level (implemented with the sandwich and lmtest R packages). This is comparable to generalized estimating equations (GEE) with an exchangeable working correlation and is recommended for matched observational data34,35. Under the final specification (nearest-neighbor matching, caliper 0.10, exact-on-sex), the diabetes coefficient was ≈ 0.216 with a cluster-robust SE ≈ 0.033 (z-statistic, z, ≈ 6.49; p-value, p, ≈ 8.45 × 10⁻¹¹), which corresponds to an odds ratio (OR) of ~ 1.24. Therefore, compared with the unadjusted analysis, this smaller but still significant effect supports effective confounding control through PSM. If any covariate had remained ≥ 0.10 in SMD after matching, a doubly robust outcome model adjusting for that covariate would have been added, still using cluster-robust SEs.

Reporting guidelines for PSM studies

Essential elements to report

Covariates included and rationale

It is essential to list all covariates included in the propensity score model and provide a rationale for their inclusion. Covariates should be selected on the basis of theoretical considerations, prior empirical research, or both, ensuring that they are related to both the treatment and the outcome. This comprehensive selection process helps to adequately control for confounding factors36,37.

Matching algorithm and parameters

Describe the matching algorithm used, such as nearest neighbor, caliper matching, or Mahalanobis distance matching. Additionally, specify any parameter set, such as the caliper width or matching ratio. For example, nearest neighbor matching might use a 1:1 or 1:2 matching ratio, and caliper matching might specify a caliper width of 0.1 standard deviations of the logit of the propensity score29,37,38.

Sample sizes before and after matching

The sample sizes of the treatment and control groups are reported both before and after matching. This information is crucial for understanding the extent of data reduction due to matching and the potential impact on statistical power29,37.

Methods and results of balance assessment

The methods used to assess the balance of covariates between the treatment and control groups after matching were outlined. This typically involves SMDs, variance ratios, or graphical methods such as histograms and jitter plots. Reporting the results of these assessments demonstrates the effectiveness of the matching process in achieving covariate balance29,36,37,38.

Sensitivity analysis

A sensitivity analysis is conducted to check the robustness of the matching results. This includes varying the caliper width and trying different matching methods to ensure that the conclusions are consistent across different specifications. For example, varying the caliper width and using different matching methods can provide insights into the stability of the estimated treatment effects. The results of these analyses show that the findings are robust to different matching parameters and methods.

Statistical analysis of treatment effects

Detail the statistical methods used to estimate treatment effects after matching. This might include regression adjustment, difference-in-differences analysis, or instrumental variable approaches. The results of these analyses, including estimates of treatment effects, confidence intervals, and significance levels38, are presented.

Best practices for transparency and reproducibility

Detailed methodological description

A comprehensive description of all the steps in the PSM process, including data preprocessing, propensity score estimation, matching procedures, and balance assessment, is provided. This ensures that other researchers can fully understand and replicate the methodology.

Code and data sharing

Share the code used for propensity score estimation and matching, preferably with annotations to explain each step. Where possible, make the dataset or a synthetic version available to allow others to replicate the analysis. This practice enhances transparency and facilitates validation by other researchers.

Documentation of assumptions

Clearly, document all assumptions made during the analysis, such as the ignorability assumption and discuss their potential impact on the study’s conclusions. This transparency allows for a better understanding of the limitations and strengths of the study.

Thorough reporting of results

Include detailed tables and figures that show the balance of covariates before and after matching, as well as the estimated treatment effects. Provide supplementary materials if necessary to keep the main text concise. This thorough reporting ensures that the results are clear and interpretable.

Ethical considerations

We discuss any ethical considerations related to the data and analysis, including issues of consent, confidentiality, and potential biases introduced by the matching process. Addressing these issues is crucial for maintaining ethical standards and the integrity of research. It is also recommended to follow established guidelines, such as the STROBE guidelines, to further ensure clarity, transparency, and methodological rigor in reporting observational studies1.

By following these guidelines, researchers can enhance the transparency, reproducibility, and credibility of PSM studies. Ultimately, this contributes to the robustness of causal inference in observational research.

Complementary causal inference approaches

In addition to PSM, it is advisable to use complementary methods to address limitations inherent to matching alone. One valuable tool in this regard is the use of DAGs, which help make causal assumptions explicit and guide the selection of appropriate covariates for adjustment, reducing the risk of including mediators or colliders that could bias estimates39,40. Furthermore, an approach such as inverse probability of treatment weighting (IPTW) is an alternative to matching that can help preserve sample sizes and hence power; in particular, marginal structural models (MSMs) are well-suited when exposures and confounders vary over time, as is often the case in kidney transplant research41,42. For example, recent studies have demonstrated how IPTW can emulate a target trial comparing transplantation with long-term dialysis, providing more robust effect estimates than matching alone43.

Additionally, doubly robust estimators, which combine outcome regression with propensity-based weighting, offer an extra safeguard by providing valid causal estimates if either the propensity score model or the outcome model is correctly specified44,45. This dual protection makes doubly robust methods a valuable complement to both PSM and IPTW, particularly in observational contexts where some model misspecification is likely. By combining DAGs, IPTW, MSMs, and doubly robust estimation alongside PSM, researchers can strengthen the validity, transparency, and interpretability of causal inferences in kidney transplantation studies.

Conclusion

In conclusion, PSM plays an important role in kidney transplant research by providing a structured approach to control for confounding and strengthen the validity of findings from observational data. It improves comparability between treated and control groups on key baseline variables, allowing researchers to estimate treatment effects more reliably. Nonetheless, its effectiveness depends on appropriate covariate selection, consistent data quality, and careful consideration of potential sample size reductions, which may influence statistical power. To address these challenges, combining PSM with complementary methods such as DAGs, IPTW, and MSMs can help account for complex causal structures and time-varying confounding. Maintaining rigorous methodology, clearly reporting each analytical step, sharing code and assumptions, and conducting sensitivity analyses are essential for ensuring transparent and reproducible results. By applying these principles and adhering to established reporting standards such as STROBE, researchers can contribute to more robust and informative evidence that supports decision-making in kidney transplantation.