Introduction

The liver is one of the most frequently injured organs. Liver trauma is responsible for about 15% of all abdominal injuries1,2 and approximately 5% of all emergency admissions3. Liver trauma is often combined with further complications, such as secondary bleeding, bile duct injury, bile leakage, and peritonitis syndrome, leading to high mortality that can exceed 75%4,5. The prognosis of patients could be notably improved by quick evaluation of the injury and minimally invasive treatment6,7. Thus, timely assessment of the liver is the priority for trauma patients, especially under emergency conditions.

Ultrasound (US) is a commonly used method to assess liver trauma. It is an affordable real-time imaging approach, allowing dynamic, convenient, and continuous examination2,8. However, compared to other imaging techniques, the quality of US imaging is significantly affected by the physiological characteristics of patients and is more dependent on the operation methods and human factors9. Thus, artificial intelligence (AI)-empowered ultrasonography has the potential to accelerate further the use of medical ultrasound in various clinical settings by analyzing the patient’s multimodal imaging information, simplifying the operation steps and avoiding subjective differences, saving physician resources, shortening the reporting time, and improving the diagnostic efficiency10,11,12,13.

At present, AI studies are mainly focused on investigating thyroid nodules, breast nodules and liver tumors; Due to the smaller clinical sample size compared to superficial organs such as thyroid and breast, the complexity of images containing numerous tissues and organs, and the difficulty of computer learning, there are relatively few reports on artificial intelligence assisted diagnosis in obstetrics, gynecology, pelvic floor, and vascular diseases. In the context of liver injury, AI can potentially find the differences of subtle texture features in images that cannot be recognized by the naked eye14, which may be helpful for the early diagnosis of liver parenchymal contusion and laceration missed by doctors in US images. Aside from liver tumors, successful usage of the AI was reported for non-alcoholic fatty liver disease15, cholangitis16, and focal liver lesions17. Therefore, we established and validated a new convolutional neural network-based deep learning model(CNDLM) for identifying liver trauma (parenchymal contusion and laceration). We hypothesized the AI-assisted US model could achieved comparable diagnostic efficiency with sonographers, and could assist early identifying of liver trauma.

Results

Establishment of the CNDLM

A total of 1603 US images (1373 on the training set, 230 in the validation set; 116 were selected as the test set) were obtained from pigs and 126 from patients admitted to 3 centers (30 from the first center were selected as training set for fine-tuning of CNDLM, 18 were selected as a clinical internal validation set, and 78 were selected as clinical external validation set). There were 3 animal CV models and 3 clinical CV models after training and fine-tuning, with acceptable accuracy in animal test set (Table S1 and Fig. 1). Then the CNDLM was established.

Fig. 1
figure 1

Study flow chart. The three steps were: data preparation, model construction, and model evaluation. The data preparation included collecting animal and clinical US images. The model construction was a model training process. Three-fold cross-validation was used to train the animal CV model, which was then transferred into clinical data, finally ensembling 3 CV models. The model evaluation was performed as a comparison with the trained sonographers using visual methods and quantitative metrics.

Performance of CNDLM

The CNDLM achieved significantly better performance (DSC = 0.91 [0.87–0.95]; TPR = 0.91 [0.86–0.95]; PPV = 0.95 [0.93–0.97]; 95_HD = 2.55 [1.85–3.56]) compared to Junior and Senior sonographers (all P < 0.001) (Table 1; Fig. 2) in animal test set. The CNDLM also demonstrated satisfactory performance (Dice = 0.88 [0.84–0.93]; TPR = 0.87 [0.81–0.92]; PPV = 0.92 [0.86–0.96]; 95_HD = 11.78 [4.08–22.04]) in the clinical internal test set (Table 1; Fig. 3), showing superior diagnostic performance compared to Junior (P = 0.001, 0.012, respectively) and Senior sonographers (both P = 0.004). In the external test set, the CNDLM achieved the DSC of 0.74 (0.66–0.81), TPR of 0.80 (0.73–0.87), PPV of 0.74(0.65–0.82), and HD-95 of 14.84 (8.34–23.11). The performance was significantly better than one of the junior sonographers, but significantly inferior to another junior sonographers (Table 1; Fig. 4).

Table 1 The detailed quantitative metrics of the ensembled model and different level sonographers.
Fig. 2
figure 2

The segmentation results of ensembled models and different level sonographers in test animal set. The red region is a mask, the green region is a model prediction, and the yellow region is an overlap region of mask and model predictions.

Fig. 3
figure 3

The segmentation results of the ensembled model and different level sonographers in the clinical internal test. The red region is a mask, the green region is a model prediction, and the yellow region is an overlap region of mask and model predictions.

Fig. 4
figure 4

The segmentation results of the ensembled models and different level sonographers in the clinical external test set. The red region is a mask, the green region is a model prediction, and the yellow region is an overlap region of mask and model predictions.

Discussion

In this study, a deep learning model was proposed for liver injury detection, which could achieve satisfactory diagnostic performance in assessing clinical US images comparable to that of the trained sonographers. To the best of our knowledge, this is the first study that proposed the clinical DL model for liver trauma evaluation that demonstrated good accuracy. Applying AI methods in the liver US could potentially increase the quality of imaging while limiting the influence of the human factor to achieve an early, rapid, and accurate diagnosis of liver trauma.

The Bama miniature pig’s liver resembles the human liver in terms of anatomy, morphology, and texture echo18. Thus, these animals were selected as the experimental animal in this study. The automatic segmentation results of the model were compared with the standard range manually outlined by the trained sonographers, and DSC evaluated the output performance of the model. It is generally believed that the Dice coefficient > 0.7 indicates that the model has better performance19,20. In this study, the DSCs were all above 0.70, which confirmed that the automatically segmented liver injury area and the real injury area overlapped. The average positive predictive value (PPV) reached 0.884, and the average true positive rate (TPR) was about 0.780, which is comparable with the previous reports by Sato et al.21 or Saillard et al.22 for the DL models in the diagnosis of hepatocellular carcinoma. The obtained results confirmed that the model not only accurately located the wound location but also outlined the wound contour more reliably.

Despite the promising accuracy, peripheral false positives or internal false negatives were reported when the lesions were over-segmented or under-segmented. This may be at least partly explained by the following two points: firstly, the grayscale of the ultrasound image of liver trauma changes with time. Within a week of the formation of liver trauma, the trauma area often shows a heterogeneous echo area or a slightly hyperechoic area, and sometimes the grayscale contrast with the normal liver parenchyma is not significant. With time, the echo in the lesion area gradually decreases, and the grayscale contrast with normal tissue also gradually increases. Most lesions have echogenicity similar to the surrounding liver parenchyma in about two months23. Secondly, in the case of slight injury, the grayscale contrast with normal tissue is not significant, and the more severe the injury, the greater the grayscale difference from normal tissue. Therefore, the sensitivity and specificity of the proposed model should be further evaluated based on the above data.

One of the strengths of this study is the large number of animal US image data collected for training, which were used to guide the model and maintain the existing low-level features through pre-trained weights. Additionally, the proposed AI model was fine-tuned using ultrasound image data of patients to correct the difference between human and animals. This approach could potentially increase the diagnostic accuracy to the level of human sonographers, as was previously reported24. To test the potential of the model, this study compared the diagnostic accuracy of physicians with different seniority and AI. The results showed that the model outperformed some junior physicians in external test set, and the ensemble model had better diagnostic performance than physicians of all levels in the internal test set, which is consistent with previous studies24,25. Thus, in small hospitals with comparable conditions, as well as teaching hospitals that undertake the teaching work of young residents, the use of AI-assisted diagnosis could potentially help to improve the quality of ultrasound diagnosis.

Finally, the concept of “golden time” is emphasized in trauma first aid, with time becoming the most important factor affecting survival26. Therefore, the methods that shorten the assessment time before the surgery are most important. In emergency situations, US can effectively guide clinical classification and treatment, but its interpretation is significantly affected by the physiological characteristics of patients27. Moreover, ultrasound examination has certain subjective differences, and training professionals require a lot of long-term practice and learning. In contrast, combining US with AI can simplify operation steps, avoid subjective differences, save physician resources, shorten examination time, and improve diagnostic efficiency. This study demonstrated that the application of AI to diagnosing liver trauma could potentially maximize the portability and speed of real-time US interpretation, which was consistent with previous studies28,29,30.

This study has several limitations. Firstly, the number of clinical US images was limited due to the low incidence of liver trauma, resulting in an imbalance of image sources. For this issue, this study supplemented clinical data to fine-tune the animal model; through data amplification and batch normalization, overfitting was reduced as much as possible, and robustness was enhanced. Secondly, clinical data in this study were collected retrospectively, with could have led to a potential bias. The transfer learning method was adopted due to the lack of clinical US data, which cannot meet the needs of building deep learning models. Although we used methods such as increasing the data set, enhancing the data, reducing the learning rate, and using regularization parameters for data preprocessing, there was still a certain overfitting problem in the model training, which needs to be increased in future research. Finally, clinical data of patients (including weight and blood pressure) were not taken into account at this stage. Future studies should focus on validating the model in the clinical setting.

In conclusion, this study proposed a deep learning model based on RefineNet structure, where the encoding process uses ResNet101 based on pre trained weights from ImageNet. RefineNet is a generative multi-path enhancement network that utilizes residual connections to maximize the use of all information from the downsampling process when making high-resolution predictions. Therefore, this structure can capture advanced deep features related to damage in the image. This deep learning model is helpful for liver injury detection from 2D ultrasound images, which achieved comparable diagnostic performance with Junior sonographers.

Methods

The study was undertaken in two stages. The first stage included the animal experimental US images as a training set and validation set. The second stage included the clinical US images as a test set. The animal study strictly complied with the National Institute of Animal Health regulation in experimental design, animal feeding, and sample acquisition. This whole study was approved by the Ethics Committee of the PLA General Hospital (approval number: S2020-323-01). All methods were carried out in accordance with relevant guidelines and regulations.

Establishment of the experimental model

The animal experimental stage took place between January 2021 and April 2021. Eighteen Bama miniature pigs (Animal Experimentation Center, PLA General Hospital) weighing 30–35 kg underwent intramuscular anesthesia (0.3 ml/kg, mix midazolam and serrazine hydrochloride intramuscular injection in the ratio of 1:1) and endotracheal intubation. The animals were fixed, the venous channel was established through the marginal ear vein, and the liver parenchyma contusion and laceration were modeled with a self-made percussion device (Supplementary Fig. S1). After modeling, the contrast-enhanced ultrasound (Ultrasound Diagnostic Instrument with 5–15 MHz Convex probe, 3–5 MHz frequency) was performed to obtain multiple gray-scale (GS) ultrasound images of the trauma area from different angles. The liver trauma region was defined as region without enhancement in US images, evaluated by a junior sonographer (with less than 3 years of experience). The experimental animals were euthanized through deep anesthesia using Soothe 50 (10 mg/kg) and Xylazine Hydrochloride (8 mg/kg) intramuscular injection, followed by exsanguination to induce death.

Clinical data collection

Clinical US images were collected from patients with liver trauma treated in the first medical center of PLA General Hospital, the seventh medical center of PLA General Hospital, and Beijing Chaoyang emergency center between January 2015 and April 2021, to avoid data bias.

The inclusion criteria were: (1) trauma history within one week; (2) liver trauma confirmed by surgery or contrast-enhanced ultrasound; (3) complete clinical and imaging information. The exclusion criteria were: (1) poor image quality; (2) chronic liver lesions (such as liver cirrhosis); (3) non-traumatic liver rupture (such as rupture of liver tumor). The golden standard for liver trauma region was defined as a region of no enhancement in contrast-enhanced US imaging, drawn by the same sonographer. The true positive was defined as the overlap between the liver trauma region defined and the liver trauma region in the gold standard.

It is helpful for avoiding bias from single hospitals to combine the data from all hospitals and split them into 3 datasets. Thirty clinical US images from the data collected from the first medical center of PLA General Hospital and the seventh medical center of PLA General Hospital were randomly selected as fine tuning set, and the rest images from the two centers were selected as clinical internal test set. The clinical US images collected from Beijing Chaoyang emergency center were selected as clinical external test.

Establishment and evaluation of the deep learning (DL) model

In this study, animal experimental data was divided into training dataset, validation dataset, and testing dataset, with a ratio of 6:1 between the training dataset and validation dataset. The testing dataset was independent of the training and validation datasets. The training dataset is used to train the model, while the validation dataset is used to validate the trained model. After the training process, the model is evaluated using a test dataset. The detailed quantitative indicators are shown in Table S1 of the test set.

Clinical models are fine tuned from animal models by fine-tuning the dataset. We collected 48 patients from the First and Seventh Medical Centers of the People’s Liberation Army General Hospital, and 78 patients from the Beijing Chaoyang Emergency Center. Collecting data from different hospitals and obtaining differential features from animal and clinical data can be helpful in eliminating operational factors.

We constructed a three fold cross validation (CV) model using an animal dataset and trained it using pre trained weights from ImageNet. Triple fold cross validation is a method of dividing a sample set into three parts, using one part as the validation set and the other data as the training set for algorithm training, analysis, and validation. Cross validation is commonly used to evaluate machine learning model performance and hyperparameter tuning. The difference between these three models lies in the training and validation datasets. During this process, three models were trained in each experiment. Then, these three animal CV models fixed all parameters except for the last two convolutional blocks and fine tuned them using clinical training data to generate three clinical models. These three models were integrated into the final trauma assisted diagnosis model (CNDLM). And the performance of CNDLM was compared with data obtained by other ultrasound physicians using animal test sets, clinical internal validation sets, and clinical external validation sets.

The workflow of the segmentation DL model is shown in Fig. 1. The segmentation mask was extracted through the Visual Object Tagging Tool (VOTT, Microsoft, USA) and OpenCV package in python. Then, the images and segmentation mask were resized (224*224 pixels). The model was developed with the RefineNet (Supplementary Fig. S2). A Resnet101 network pre-trained on the ImageNet dataset was employed in RefineNet (Supplementary Fig. S3 and Table S2). The final model was evaluated by internal test cohort and external test cohorts with human US images. In the training process, the parameters of the model were iteratively updated through backpropagation, whose input and output was the image and segmentation mask, respectively. The dice loss of the output and segmentation mask was used as the loss function. The data enhancement strategy was employed to reduce the impact of overfitting, including random flipping and rotation of the input image. A detailed description of the model establishment and evaluation are shown in the Supplementary Methods section.

To compare the diagnostic efficiency of the final model, the animal and clinical images in validation were evaluated by two junior sonographers (with less than 3 years of experiences) and two senior sonographers (with more than 10 years of experiences).

Statistical analysis

R software (version 3.3.1; R 21 Foundation for Statistical Computing, Vienna, Austria) was used for statistical analysis. The Dice Similarity Coefficient (DSC), True Positive Rate (TPR), Positive Predictive Value (PPV), and 95% Hausdorff distance (HD-95) were used to evaluate the performance of CV, CNDLM, and sonographers. Models were compared for DSC in physician identification using the Kruskal-Wallis test. All statistical tests were two-sided, and the statistical significance level was set at two-sided P < 0.05.