How machine learning on real world clinical data improves adverse event recording for endoscopy

Wittlinger, Stefan; Wiest, Isabella C.; Ladani, Mahboubeh Jannesari; Kather, Jakob Nikolas; Ebert, Matthias P.; Siegel, Fabian; Belle, Sebastian

doi:10.1038/s41746-025-01826-5

Download PDF

Article
Open access
Published: 10 July 2025

How machine learning on real world clinical data improves adverse event recording for endoscopy

Stefan Wittlinger¹^na1,
Isabella C. Wiest^1,2^na1,
Mahboubeh Jannesari Ladani³,
Jakob Nikolas Kather^2,4,5,
Matthias P. Ebert^1,6,7,
Fabian Siegel³^na2 &
…
Sebastian Belle¹^na2

npj Digital Medicine volume 8, Article number: 424 (2025) Cite this article

2451 Accesses
1 Citations
1 Altmetric
Metrics details

Subjects

Abstract

Endoscopic interventions are essential for diagnosing and treating gastrointestinal conditions. Accurate and comprehensive documentation is crucial for enhancing patient safety and optimizing clinical outcomes; however, adverse events remain underreported. This study evaluates a machine learning-based approach for systematically detecting endoscopic adverse events from real-world clinical metadata, including structured hospital data such as ICD-codes and procedure timings. Using a random forest classifier detecting adverse events perforation, bleeding, and readmission, we analysed 2490 inpatient cases, achieving significant improvements over baseline prediction accuracy. The model achieved AUC-ROC/AUC-PR values of 0.9/0.69 for perforation, 0.84/0.64 for bleeding, and 0.96/0.9 for readmissions. Results highlight the importance of multiple metadata features for robust predictions. This semi-automated method offers a privacy-preserving tool for identifying documentation discrepancies and enhancing quality control. By integrating metadata analysis, this approach supports better clinical decision-making, quality improvement initiatives, and resource allocation while reducing the risk of missed adverse events in endoscopy.

Interpretable deep learning model diagnoses gastrointestinal stromal tumors and lesion characteristics with microprobe endoscopic ultrasonography

Article Open access 02 October 2025

Explainable artificial intelligence incorporated with domain knowledge diagnosing early gastric neoplasms under white light endoscopy

Article Open access 12 April 2023

A machine learning framework supporting prospective clinical decisions applied to risk prediction in oncology

Article Open access 16 August 2022

Introduction

Endoscopic interventions have become critical tools in diagnosing and treating various gastrointestinal (GI) diseases, offering minimally invasive solutions with reduced patient recovery time and morbidity. However, as with any medical procedure, these interventions carry inherent risks of adverse events, which may range from minor adverse events to life-threatening situations¹. Despite advancements in technology and technique, the occurrence of adverse events remains a significant concern, necessitating a thorough understanding of their nature, frequency, and contributing factors.

Accurate and consistent documentation of adverse events is essential to enhancing the overall quality and safety of endoscopic procedures. Not only does comprehensive documentation provide valuable data for individual patient care and informed clinical decision-making, but it also facilitates broader quality improvement efforts². By systematically recording adverse events, healthcare providers can identify patterns, benchmark performance, and implement targeted interventions to reduce risk. Additionally, such data is critical for developing evidence-based guidelines, conducting risk assessments, and improving the training of endoscopists.

Adverse events at the University Hospital Mannheim (see Supplementary Table 1) and in the German screening colonoscopy registry are underreported³. Systematic recording of adverse events is challenging because they can arise at any point in a patient’s medical history. While such recordings may work reasonably well for adverse events identified during the endoscopy procedure itself, they fail to capture those that emerge later, either during the subsequent hospital stay or upon readmission. In some instances, readmission may occur at a different hospital, further complicating documentation. Consequently, these adverse events may remain undocumented in the systematic recording system, even though they might be noted in other written records, such as discharge letters.

In medicine, machine learning has been widely applied to computer vision^4,5, one prominent example being applications within endoscopy⁶. It has also been used to predict the onset of diseases^7,8,9,10,11 from clinical data. Large language models (LLM), a subset of machine learning, have demonstrated their utility in systematically extracting information from unstructured electronic health records^12,13,14. LLMs have also been utilized to extract information from the Mannheim colonoscopy dataset¹⁵ used in this work.

In this paper, we investigate the potential of detecting adverse events from metadata, that is, structured data such as hospital stay duration, material used during endoscopy, ICD codes, and other related information (see Fig. 1 for examples and Supplementary Tables 2–3 for full list)¹⁶. Crucially, our objective is to detect adverse events that have already taken place, not to predict their occurrence in advance. Detecting adverse events from metadata could enhance the accuracy of adverse event records and help identify discrepancies between written reports and their corresponding metadata. Specifically, we aim to investigate whether certain adverse events leave identifiable signatures within the metadata that can be extracted using machine learning.

**Fig. 1: Example of data generated during a hospital stay.**

Results

We focus on three key adverse events associated with endoscopic mucosal resection (EMR): bleeding, perforation, and readmission within 30 days due to EMR-related issues (see supplementary material for consensus definitions). If bleeding or perforation became apparent during or after readmission, it was classified as readmission rather than as bleeding or perforation, respectively.

We trained a machine learning algorithm, more specifically a random forest classifier, on the metadata (see Supplementary Tables 2–3 for features used). To test our method, we utilized 2490 cases of inpatient stays including at least one endoscopic procedure with endoscopic mucosal intervention performed between 2010 and 2022 at the University Hospital Mannheim. A general characterization of the cohort is provided in Table 1. Cohort characteristics of the training dataset are shown in Table 2, and those of the test dataset are presented in Table 3. Additional details of the cohort characteristics can be found in Supplementary Tables 4–7. No restrictions were applied regarding the reason for hospitalization. The most common diagnoses were polyps and benign neoplasms in colon, cecum, or rectum.

Table 1 Cohort characteristics of the dataset

Full size table

Table 2 Cohort characteristics of the training dataset

Full size table

Table 3 Cohort characteristics of the test dataset

Full size table

All 2490 cases were utilized for classifying the adverse event types of bleeding and perforation. For readmission as an adverse event, 213 cases were analyzed; these represent instances where patients were readmitted within 30 days post-discharge. The study design for the adverse events perforation and bleeding is illustrated in Fig. 2, while the scheme for readmission as an adverse event is presented in Fig. 3.

**Fig. 2: Training and testing scheme for adverse events perforation, and bleeding.**

**Fig. 3: Training and testing scheme for adverse events of readmission.**

For the supervised machine learning algorithm, labels were obtained by utilizing the written reports (endoscopy reports, discharge notes, readmission notes). These were either reviewed by an expert (referred to as “manually generated labels”) or generated using a large language model (LLM), referred to as “LLM-generated labels.” The manually generated labels were generally considered the ground truth, meaning they were assumed to be fundamentally correct. Among the 500 manually labeled cases used for testing adverse events, bleeding and perforation, 134 cases involved bleeding, and 37 cases involved perforation as an adverse event. Of the 213 manually labeled cases available for readmission analysis, 45 were identified as adverse event readmissions.

Model achieves high classification performance for readmission

The results were evaluated using receiver operating characteristics (ROC) and precision-recall (PR) curves. The primary performance metrics considered were the area under the curve (AUC) for both. Due to class imbalance, the precision-recall AUC (AUC-PR) was prioritized as the main performance metric over the ROC AUC (AUC-ROC)¹⁷. The AUC-PR performance of our machine learning algorithm was compared against a baseline dummy classifier that performs random classification.

The results of our analysis are presented in Figs. 4 and 5. The AUC-PR was calculated to evaluate the accurate classification of adverse events, yielding values of 0.69 for perforation (compared to a dummy classifier’s 0.07), 0.64 for bleeding (dummy classifier: 0.27), and 0.9 for readmission (dummy classifier: 0.21). The corresponding AUC-ROC values are 0.9 for perforation, 0.84 for bleeding, and 0.96 for readmissions. The confusion matrices and individual AUC-ROC/-PR curves are provided in Supplementary Figs. 1–6. For comparison, two gradient-boosted decision tree algorithms (LightGBM and CatBoost) and a deep neural network (TabNet) were applied to the same dataset. CatBoost performed on par, achieving AUC-ROC scores of 0.90 for perforation, 0.85 for bleeding, and 0.95 for readmission, as well as AUC-PR scores of 0.71 for perforation, 0.65 for bleeding, and 0.89 for readmission. For more details, see Supplementary Fig. 7.

**Fig. 4: Test results for adverse event readmission.**

**Fig. 5: Test results for adverse events bleeding and perforation.**

The ROC and PR curves shown in Supplementary Figs. 4–6 show that perforation and readmission achieve a perfect positive predictive value (PPV) of 1 at small sensitivity (true positive rate) values. In contrast, bleeding never reaches a PPV of 1, even at low sensitivity values.

Cross-validation demonstrates stable performance

To assess the stability of the random forest classifier, we employed random subsampling as a cross-validation technique¹⁸. The error bars, which reflect the variability of the performance estimates, are shown in Fig. 4 for adverse event readmission and in Fig. 5b for adverse events bleeding and perforation. Note that, for perforation and bleeding, random subsampling could not be performed on the actual test data, as only 500 manually labeled cases were available. Instead, this analysis was conducted with labels generated by a large language model (2490 cases in total) for the entire dataset. The AUC-ROC and AUC-PR values in Fig. 5b are approximately within 10% of those in Fig. 5a, where manually labeled cases were used as test data. The error bars here serve as an estimation of the variability that would exist if all labels were manually generated.

As an additional validation step (see Supplementary Table 8), we analyzed the subset of 500 samples for which both manual and LLM-generated labels were available, performing 1000 bootstrapping iterations to assess model performance for adverse events perforation and bleeding. The results showed similar AUC-ROC and AUC-PR values for both label types, potentially indicating comparable model performance when using either LLM-generated or manual labels.

Metadata deviations from regular care plans strongly indicate adverse events

For perforation, the top three key features were “Charlson comorbidity index”, “OPS-Code 5-469.D3 (endoscopic clipping),” and “hemostasis clipping 235 mm”. The most important features, as determined by SHAP¹⁹, are displayed in Fig. 6. For bleeding, the top three features were “OPS-Code 5.493.D3 (endoscopic clipping)”, “Charlson comorbidity index,” and “hemostasis clipping 155 cm”. In both cases, the Charlson comorbidity index and the use of clips were associated with adverse events. These deviations from the normal care plan suggest that an adverse event may have occurred. Interestingly, while the ICD-code K63.1 (perforation) might seem like a strong indicator of perforation, in our test set, only 8 out of 37 cases with this code were actually related to EMR-induced perforations, therefore requiring additional features to achieve robust classification performance.

**Fig. 6: Ten most important features for adverse events perforation, bleeding, and readmission.**

For readmission, the top three predictors are “ICD-code K92.2 (gastrointestinal bleeding)”, “Discharge to readmission time,” and “ICD-code T81.0 (bleeding and hematoma as a complication of a procedure)”. This suggests that rapid readmission accompanied by bleeding—whether related to a procedure or occurring in the gastrointestinal tract—may indicate an EMR-related adverse event.

Sequential feature selection reveals performance declines with reduced feature sets

To evaluate whether a reduced number of features could achieve similar performance, experiments were conducted using only the top one, two, or three features for each type of adverse event. For example, it was hypothesized that ICD code K63.1 (perforation) might alone provide superior predictive power for perforation. However, the results showed that reducing the number of variables caused noticeable declines in performance, as measured by AUC-ROC and AUC-PR. The decline in performance was particularly pronounced for perforation and bleeding, while for readmission, the decline is less pronounced. The detailed results are shown in Supplementary Fig. 8.

Discussion and conclusion

This work demonstrates the potential of leveraging metadata to improve the systematic documentation of adverse events associated with endoscopic interventions. By applying machine learning methods, specifically a random forest classifier, to analyze hospital metadata, we identified patterns associated with specific adverse events, perforation, bleeding, and readmission, which offers the potential of automating and enhancing the accuracy of records of adverse events. Our machine learning algorithm demonstrated a substantial improvement over the baseline.

A limitation of this study is that the model was developed and evaluated using data from a single hospital. While we deliberately selected generic features that should generalize well, external validation with data from other institutions would be necessary to confirm the robustness and generalizability of our approach to different settings. Patient populations and the types of interventions performed may vary depending on the level and type of care, for example, between a tertiary care institution such as the University Medical Center Mannheim, other hospitals or clinics, and outpatient colonoscopy settings. External validation could help ensure that the observed patterns are not specific to our dataset and that the model performs consistently across various healthcare environments. Importantly, generalizability can only be robustly demonstrated when large-scale, representative data from diverse medical institutions are available—an undertaking that is not feasible at the current stage. Nevertheless, our method offers substantial practical value: it can be rapidly implemented across clinical settings with minimal human effort. The low manual overhead ensures that individual centers can adopt the system efficiently, requiring only modest local adaptation. We view this ease of deployment as a central strength and a key contribution of our work. Another limitation of our study is that potential readmissions to other hospitals are not captured in our dataset. However, with the electronic patient record system now being made available in Germany, this limitation may be mitigated in the future.

An important finding of this study is that for adverse events perforation and bleeding, no single feature, or even a small subset of features, can solely explain the model’s predictions. For the adverse event of readmission, three features already yield respectable classification performance, but adding more features further improves performance. The results indicate that the prediction of any adverse event type relies on the complex interplay of multiple features in the metadata. This underscores the need for a comprehensive approach in predictive modeling, where the combination of variables, rather than individual ones, leads to optimal performance.

Our model was particularly successful in detecting adverse events related to perforation and readmission, yielding strong performance metrics. Adverse events of the type bleeding were also detectable, although with somewhat lower predictive accuracy. This may be due to the less distinct and consistent signature of bleeding in the metadata, which complicates precise prediction. Unlike perforation, which is often linked to specific indicators such as increased use of clipping material, longer hospital stays, and even specific ICD codes, bleeding appears to lack such clear markers. This could be because bleeding does not significantly alter the course of therapy in a way that is easily captured in metadata, making it harder to detect or track compared to perforation, which has more direct and identifiable association with clinical data. Given that perforation and readmission cause higher costs to the health system, and these adverse events can be detected more accurately, the proposed system might be particularly effective in identifying costly and resource-intensive adverse events.

Even when trained on limited, noisy datasets, such as in this study, the model demonstrated its robustness and reliability, evidenced by narrow error bars for the AUC-ROC and AUC-PR. This indicates that the algorithm performs consistently across different subsets of the data and is not overly sensitive to small variations. Overall, these findings suggest that our machine learning approach not only enhances the predictive accuracy for rare adverse events but also maintains stability across various data subsets. This makes it a promising tool for clinical decision-making. The lower accuracy for bleeding, however, highlights an area for further refinement, where additional data, feature engineering, or potentially a narrower definition of bleeding might improve its detection. In the future, incorporating additional data could potentially enhance the accuracy of classifying any of the investigated adverse events.

Furthermore, if the tool has already been trained and is solely deployed, it operates independently of written text. This enables metadata analysis to be conducted in a privacy-preserving manner, avoiding the need to transfer patient-identifying data to cloud servers, an approach not always feasible with large language models. Furthermore, when trained on high-quality labeled data and deployed in real-world clinical settings, our method has the potential to flag adverse events that may not be explicitly documented in free text but are discernible through patterns in structured data.

Enhancing the integration between metadata analysis and patient records has the potential to significantly improve the accuracy of adverse event documentation, while reducing inconsistencies between reported and undocumented events^20,21. Regarding individual patients, if information about an adverse event is lost, it could potentially compromise their subsequent treatment and care. Regarding the improvement of treatment outcomes now and in the future, a complete and accurate systematic record of all adverse events is essential for learning and refining clinical practices.

Among others, we envision the following practical implementation within a comprehensive clinical support system for the electronic health record: at discharge, our machine learning algorithm could check whether any adverse events have occurred based on the structured metadata. This could then be compared to what is written in the discharge letter. If a discrepancy is detected, an alert could prompt clinicians to verify whether all significant events have been accurately documented. For example, an adverse event may have occurred and been recorded as an ICD diagnosis, but due to handover mistakes, it might not be properly documented in the discharge letter, especially if the responsible physician for the colonoscopy differs from the one handling the discharge. Incorrectly coded ICD diagnoses could potentially be identified if they do not match the written reports.

Our study also demonstrated how large language model (LLM)-based text mining could assist in this process by automatically extracting relevant information from unstructured text in clinical documents, such as discharge letters. By leveraging LLMs, the system could identify adverse events and discrepancies between the structured and unstructured data. This capability could be further enhanced to perform automated cross-checking between text-based EHR documents and structured healthcare data, increasing the accuracy and completeness of clinical documentation.

An automated feedback loop could also be envisioned to inform the endoscopist who performed the intervention about the occurrence of an adverse event.

Checking for adverse events could also be conducted at other times, such as at readmission or even during the yearly review of adverse events. Specifically, to improve data quality in the systematic recording of adverse events, this algorithm could flag potential cases retrospectively, identifying occurrences of adverse events that might not have been fully documented.

Challenges for a real world pilot project include usability and interpretability. So far, our algorithm only classifies the adverse event without providing reasoning. Future work could focus on improving interpretability, for instance, by exploring the explainability of specific instances using SHAP, local interpretable model-agnostic explanations²², or other techniques. Additionally, user feedback should be integrated to enhance the learning process, which could be done by implementing reinforcement learning algorithms.

Methods

Ethical approval has been received from Ethics Committee II, Medical Faculty of Mannheim, Heidelberg University (approval number 2021-694). Pseudonymized data processing in retrospective studies is exempt from obtaining individual patient consent under the applicable regulatory framework.

Labeling process

Labels for the supervised machine learning algorithm were obtained from written reports, including endoscopy reports, discharge notes, and readmission notes. Labels reviewed and assigned by an expert are termed “manually generated labels,” while those produced by a large language model (LLM) are referred to as “LLM-generated labels.” The manually generated labels were considered the ground truth and assumed to be accurate. Specifically, for adverse events bleeding and perforation, the ground truth was established using endoscopy and discharge notes, while adverse events in connection to readmission were identified through readmission notes, in conjunction with previous reports to determine their association with prior endoscopic mucosal resections. In order to obtain reliable performance metrics, testing was only performed on cases with manual labels available.

For readmission, all 213 cases were manually labeled, providing a complete set of high-quality ground truth data. However, for the adverse events bleeding and perforation, only 500 out of the 2490 cases were manually labeled and used as the test dataset. The remaining cases, used as training data, were labeled using the large language model Llama-2 70b in a fine-tuned version for German language (Llama-2 70B “Sauerkraut”) as described in ref. ¹⁵. The referenced paper provides details of the implementation. A simple prompt asking to extract adverse events was found to work well. The ground truth was established using the definitions provided in Supplementary material.

It is important to note that this method of using large language models may introduce slight inaccuracies due to the potential for noisy labels generated by the model. For an analysis of the quality of these LLM-generated labels, see Supplementary Fig. 9.

Data preprocessing and feature engineering

Data processing was performed using pandas²³. The data was delivered through multiple Excel files that were combined using pandas. Unstructured data, such as written texts, was removed from the dataset. One-hot encoding was used for categorial data. For materials used during endoscopy, the quantity of each material was also encoded. For example, if three clips of type “hemostasis clip 235 cm” were used, this was encoded in the “hemostasis clip 235” category as 3.

If appropriate, imputation using the median value was applied. This means that if values were missing for a specific case, they were replaced with the median value. If no readmission took place, feature “admission-to-readmission time” was set to 1000 days. In seven cases, the patient underwent two endoscopic interventions, both including EMR in one hospital stay. In these cases, the DRG, OPS codes, as well as the material used, were combined. The date of the first intervention was used to calculate the feature value “procedure-to-readmission time,” while the maximum value for feature “procedure time” was used. Besides this, the circumstance of multiple interventions in one hospital stay was captured with the feature “number of procedures”.

Machine learning algorithm

A random forest classifier, implemented in Scikit-learn²⁴, was trained for classification. A random forest classifier can be expected to be more robust against overfitting, which is a concern given our small and potentially noisy dataset. Other machine learning algorithms were also tested but did not demonstrate any significant performance improvement. In particular, two gradient-boosted decision tree algorithms (LightGBM²⁵ and CatBoost²⁶) and one deep learning algorithm optimized for tabular data (TabNet²⁷) were applied to the data after completion to facilitate performance evaluation.

Feature selection was conducted using backward feature elimination, an iterative method that begins with all available features and systematically removes the least significant ones. At each step, the features contributing the least to the model’s performance, as determined by an evaluation metric (in this case, impurity-based feature importance), are excluded. This process continues until the desired number of features is reached. Before backward feature selection, a total of 4547 features were available for perforation and bleeding, and 493 features for readmission.

Hyperparameter tuning and class imbalance

Hyperparameter tuning is the process of selecting the optimal combination of model parameters to maximize model performance. This is typically achieved by dividing the training data into smaller subsets, such as a training subset and a validation subset. The training subset is used to fit the model, while the validation subset evaluates its performance for each set of hyperparameter choices. The space of possible hyperparameter combinations is then systematically explored using a grid search to identify the optimal configuration (other search algorithms, such as Bayesian optimization, could also be considered as potential alternatives).

For adverse events perforation, and bleeding, the number of features was treated as a hyperparameter, ranging from 50 to 500 features in increments of 50. For adverse event readmission, the number of features was set to 100. Hyperparameter tuning of the random forest’s internal parameters was tested but had no noticeable impact on performance, either positive or negative. Consequently, all results were achieved with consistent settings: The number of estimators of the random forest (n_estimators) was set to 1000, all other parameters were set to their default values in SciKit-learn. To address class imbalance, balanced class weights, synthetic data augmentation using SMOTE, and Balanced Random Forests were tested, but did not yield any performance improvements.

Feature importance and stability

The most important features were identified using SHAP¹⁹. To evaluate the algorithm’s stability, random subsampling²⁸ was conducted 100 times. In each iteration, the data was randomly split into training and test sets, with the machine learning algorithm trained on the training set and evaluated on the test set. For adverse event of readmission, all 213 cases included manual labels, allowing direct random subsampling on the full dataset. However, random subsampling could not be directly applied to adverse events perforation and bleeding due to the limited availability of manual labels (500, defined as ground truth). For these cases, algorithm stability was assessed using labels generated by a large language model applied to the entire dataset, which are considered an approximation.

Evaluation metrics

As target metrics, we evaluated the area under the curve of the receiver operating characteristic (AUC-ROC) and the area under the precision-recall curve (AUC-PR). In particular, we focused on the precision-recall curve due to the imbalance in our dataset. Given the imbalanced distribution of adverse events and the need to accurately identify adverse events while minimizing false positives, we regard AUC-PR as the most relevant performance metric²⁹. In contrast, the AUC-ROC may provide overly optimistic estimates when applied to highly imbalanced datasets¹⁷.

The AUC-PR is assessed relative to its baseline, which is defined by the performance of a dummy classifier (i.e., one that makes random predictions). For such a dummy classifier, the AUC-PR corresponds to the occurrence rate of adverse events in the test dataset. For instance, if 20 adverse events occur in 100 cases, the AUC-PR for a dummy classifier would be calculated as 20 divided by 100, resulting in a value of 0.2 (note that the AUC-ROC of a dummy classifier is always 0.5, regardless of the rate at which adverse events occur.)

This baseline serves as a meaningful reference point to assess the model’s effectiveness in detecting rare adverse events. The error (defined as the standard deviation) of the area under the curve metrics serves as an estimate of the stability of the machine learning algorithm.

Data availability

The data are not publicly available because they consist of electronic health records collected at the University Hospital Mannheim. Publicly sharing these data violates the terms of the original ethical approval and could compromise patient privacy. De-identified patient data or other prespecified data will be made available upon approval of a written request and the signing of a data sharing agreement.

Code availability

The underlying code of this project is available on GitHub (https://github.com/stewitt/MLEndoAE). The analysis was conducted using Python (version 3.10.11) along with the following libraries, scikit-learn (version 1.0.2), NumPy (version 1.24.2), Pandas (version 2.0.0), Matplotlib (version 3.7.1), LightGBM (version 4.6.0), CatBoost (version 1.2.7), PyTorch (version 2.0.1), and SHAP (version 0.46.0).

References

Kavic, S. M. & Basson, M. D. Complications of endoscopy. Am. J. Surg. 181, 319–332 (2001).
Article CAS PubMed Google Scholar
Mergener, K. Defining and measuring endoscopic complications: more questions than answers. Gastrointest. Endosc. Clin. N. Am. 17, 1–9 (2007).
Article PubMed Google Scholar
Adler, A. et al. Data quality of the German screening colonoscopy registry. Endoscopy 45, 813–818 (2013).
Article PubMed Google Scholar
Esteva, A. et al. Deep learning-enabled medical computer vision. NPJ Digit Med. 4, 5 (2021).
Article PubMed PubMed Central Google Scholar
Harerimana, G., Kim, J. W., Yoo, H. & Jang, B. Deep learning for electronic health records analytics. IEEE Access 7, 101245–101259 (2019).
Article Google Scholar
Ali, S. Where do we stand in AI for endoscopic image analysis? Deciphering gaps and future directions. NPJ Digit Med. 5, 184 (2022).
Article PubMed PubMed Central Google Scholar
Tang, A. S. et al. Leveraging electronic health records and knowledge networks for Alzheimer’s disease prediction and sex-specific biological insights. Nat. Aging 4, 379–395 (2024).
Article PubMed PubMed Central Google Scholar
Ravaut, M. et al. Predicting adverse outcomes due to diabetes complications with machine learning using administrative health data. NPJ Digit Med. 4, 24 (2021).
Article PubMed PubMed Central Google Scholar
Zhang, X. S., Tang, F., Dodge, H. H., Zhou, J. & Wang, F. MetaPred: meta-learning for clinical risk prediction with limited patient electronic health records. In Proc. 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining 2487–2495 (Association for Computing Machinery, 2019).
Hoffman, H. et al. Machine learning for clinical outcome prediction in cerebrovascular and endovascular neurosurgery: systematic review and meta-analysis. J Neurointerv Surg. https://doi.org/10.1136/jnis-2024-021759 (2024).
Kavakiotis, I. et al. Machine learning and data mining methods in diabetes research. Comput. Struct. Biotechnol. J. 15, 104–116 (2017).
Article PubMed PubMed Central Google Scholar
Adamson, B. et al. Approach to machine learning for extraction of real-world data variables from electronic health records. Front. Pharmacol. 14, 1180962 (2023).
Wiest, I. C. et al. Privacy-preserving large language models for structured medical information retrieval. NPJ Digit Med. 7, 257 (2024).
Article PubMed PubMed Central Google Scholar
Huang, J. et al. A critical assessment of using ChatGPT for extracting structured data from clinical notes. NPJ Digit Med. 7, 106 (2024).
Article PubMed PubMed Central Google Scholar
Wiest, I. C. et al. Deep sight: enhancing periprocedural adverse event recording in endoscopy by structuring text documentation with privacy-preserving large language models. iGIE 3, 447–452.e445 (2024).
Article Google Scholar
Rajkomar, A. et al. Scalable and accurate deep learning with electronic health records. NPJ Digit Med. 1, 18 (2018).
Article PubMed PubMed Central Google Scholar
Saito, T. & Rehmsmeier, M. The precision-recall plot is more informative than the ROC plot when evaluating binary classifiers on imbalanced datasets. PLoS ONE 10, e0118432 (2015).
Article PubMed PubMed Central Google Scholar
Efron, B. CBMS-NSF Regional Conference Series in Applied Mathematics (SIAM, 1982).
Lundberg, S. M. & Lee, S.-I. A unified approach to interpreting model predictions. In Proc. 31st International Conference on Neural Information Processing Systems 4768–4777 (Curran Associates Inc., 2017).
Rädsch, T. et al. What your radiologist might be missing: using machine learning to identify mislabeled instances of X-ray images. In Proc. 54th Hawaii International Conference on System Sciences 1294 (2021).
Zhao, J., Henriksson, A., Asker, L. & Boström, H. Predictive modeling of structured electronic health records for adverse drug event detection. BMC Med. Inf. Decis. Mak. 15, S1 (2015).
Article Google Scholar
Ribeiro, M. T., Singh, S. & Guestrin, C. Why should I trust you?: explaining the predictions of any classifier. In Proc. 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining 1135–1144 (Association for Computing Machinery, 2016).
McKinney, W. Data structures for statistical computing in Python. In Proc. 9th Python in Science Conference (eds. van der Walt S. & Millman J.) 56–61 (2010).
Pedregosa, F. et al. Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011).
Google Scholar
Ke, G. et al. LightGBM: A highly efficient gradient boosting decision tree. In Proc. 31st International Conference on Neural Information Processing Systems 3149–3157 (Curran Associates Inc., 2017).
Prokhorenkova, L., Gusev, G., Vorobev, A., Dorogush, A. V. & Gulin, A. CatBoost: unbiased boosting with categorical features. In Proc. 32nd International Conference on Neural Information Processing Systems 6639–6649 (Curran Associates Inc., 2018).
Arik, S. Ö. & Pfister, T. Tabnet: attentive interpretable tabular learning. In Proc. AAAI Conference on Artificial Intelligence Vol. 35 6679–6687 (2021).
Akritas M. G., Politis D. N. Recent Advances and Trends in Nonparametric Statistics (JAI Press, 2003).
Sofaer, H. R., Hoeting, J. A., Jarnevich, C. S. & McPherson, J. The area under the precision-recall curve as a performance metric for rare binary events. MEE 10, 565–577 (2019).
Google Scholar

Download references

Acknowledgements

For the publication fee we acknowledge financial support by Heidelberg University.

Funding

Open Access funding enabled and organized by Projekt DEAL.

Author information

These authors contributed equally: Stefan Wittlinger, Isabella C. Wiest.
These authors jointly supervised this work: Fabian Siegel, Sebastian Belle.

Authors and Affiliations

Department of Medicine II, University Medical Center Mannheim, Medical Faculty Mannheim, Heidelberg University, Mannheim, Germany
Stefan Wittlinger, Isabella C. Wiest, Matthias P. Ebert & Sebastian Belle
Else Kroener Fresenius Center for Digital Health, Faculty of Medicine and University Hospital Carl Gustav Carus, TUD Dresden University of Technology, Dresden, Germany
Isabella C. Wiest & Jakob Nikolas Kather
Department of Biomedical Informatics, Mannheim Institute for intelligent Systems in Medicine (MIISM), Medical Faculty Mannheim, Heidelberg University, Mannheim, Germany
Mahboubeh Jannesari Ladani & Fabian Siegel
Department of Medicine I, Faculty of Medicine and University Hospital Carl Gustav Carus, TUD Dresden University of Technology, Dresden, Germany
Jakob Nikolas Kather
Medical Oncology, National Center for Tumor Diseases (NCT), University Hospital Heidelberg, Heidelberg, Germany, Heidelberg, Germany
Jakob Nikolas Kather
DKFZ Hector Cancer Institute at the University Medical Center, Mannheim, Germany
Matthias P. Ebert
Molecular Medicine Partnership Unit, European Molecular Biology Laboratory, Heidelberg, Germany
Matthias P. Ebert

Authors

Stefan Wittlinger
View author publications
Search author on:PubMed Google Scholar
Isabella C. Wiest
View author publications
Search author on:PubMed Google Scholar
Mahboubeh Jannesari Ladani
View author publications
Search author on:PubMed Google Scholar
Jakob Nikolas Kather
View author publications
Search author on:PubMed Google Scholar
Matthias P. Ebert
View author publications
Search author on:PubMed Google Scholar
Fabian Siegel
View author publications
Search author on:PubMed Google Scholar
Sebastian Belle
View author publications
Search author on:PubMed Google Scholar

Contributions

S.W., I.W., and S.B. were involved in conceptualization, study design, and methodology. M.J. and F.S. extracted the data. S.W. and S.B. were involved in data processing/analysis and analyzed the results of the random forest classification. I.W. executed the Large Language Model analysis. S.W. performed data processing, conceptualized, trained, and tested the random forest machine learning algorithm, and wrote the initial manuscript. S.B., F.S., M.E., and J.K. provided supervision through all stages of the study. All authors approved the final version of the manuscript for submission.

Corresponding author

Correspondence to Sebastian Belle.

Ethics declarations

Competing interests

S.B. declares consulting services for Olympus. I.W. received honoraria from AstraZeneca. J.K. declares consulting services for Bioptimus, France; Panakeia, UK; AstraZeneca, UK; and MultiplexDx, Slovakia. Furthermore, he holds shares in StratifAI, Germany, Synagen, Germany, Ignition Lab, Germany; has received an institutional research grant by G.S.K.; and has received honoraria by AstraZeneca, Bayer, Daiichi Sankyo, Eisai, Janssen, Merck, MSD, BMS, Roche, Pfizer, and Fresenius. All other authors declare no competing interests.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Paper_Metadaten_ML_supplementary

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Wittlinger, S., Wiest, I.C., Ladani, M.J. et al. How machine learning on real world clinical data improves adverse event recording for endoscopy. npj Digit. Med. 8, 424 (2025). https://doi.org/10.1038/s41746-025-01826-5

Download citation

Received: 03 February 2025
Accepted: 22 June 2025
Published: 10 July 2025
Version of record: 10 July 2025
DOI: https://doi.org/10.1038/s41746-025-01826-5