Introduction

Chronic wounds pose a significant healthcare burden across the world, both in terms of patient morbidity and healthcare costs. Older individuals face a heightened risk of developing chronic wounds due to the natural slowdown of wound healing processes that comes with aging. This risk is reinforced by the increased prevalence of cardiovascular diseases and diabetes, which tend to rise with age1. The economic cost of wound care represents up to 4% of the total health budget in developed countries1,2. With the ageing population worldwide, the demand for products and technologies to support better care will increase. The financial burden includes direct costs for wound care supplies, treatments, and healthcare professionals’ interventions, alongside indirect costs arising from productivity loss, disability, and reduced quality of life for patients and caregivers. The major types of chronic wounds consist of pressure, venous, arterial and diabetic ulcers.

The wound healing process is complex and involves several phases. Proper monitoring, assessment and documentation of the wound’s evolution over time is essential to allow the adjustment of the applied treatment according to the wound’s progression3. However, wound assessment and measurement depend on visual examination, which can be highly subjective and dependent on the clinician’s experience.

Wound evolution is assessed not only based on its dimensions, but also considering the proportion of the tissue types present in the wound bed region. The Red/Yellow/Black model allows the discrimination of tissues according to the phase of the healing process they are in, providing important information for evaluating wound healing. It associates red with granulation tissue, yellow with slough tissue (not prepared to heal), and black with necrotic (eschar) tissue4.

This work addresses the challenge of effective chronic wound bed characterisation by proposing a fully automated framework based on state-of-the-art deep learning architectures. The primary contributions of this study are summarised as follows:

  • A novel, private dataset for detailed differentiation of granulation, slough, and eschar tissues in chronic wounds;

  • Quantitative inter-rater agreement analysis, highlighting the subjectivity inherent to tissue assessment and establishing a human performance baseline to benchmark automated approaches;

  • Detailed investigation of knowledge transfer, using both convolutional neural networks (CNNs) and transformer-based models, from the simpler task of open wound segmentation to the more complex problem of tissue segmentation via fine-tuning.

  • Development of a reliable and robust fully automated pipeline for potential deployment in clinical practice, improving wound care through reduced subjectivity and increased reproducibility of healing assessments.

These findings advance the field of automated chronic wound assessment, providing a pathway toward more consistent, objective and clinically applicable decision-support tools.

Related work

Recently, several research works have focused on automating different tasks of the wound assessment process, to alleviate the burden of the healthcare professionals and make wound status evaluation more reproducible3,5. Many research lines pertain to the extraction of relevant properties from wound images, including the identification of the wound’s aetiology and location6, the outline and measurement of the wound region7, the recognition of healing complications8 and periwound skin alterations9, and the characterisation of the tissues inside the wound bed region.

Wound tissue differentiation is a challenging task in clinical practice, with its inherent subjectivity affecting the reproducibility of the wound bed composition estimated by clinical experts. Considering intra and inter-rater agreement studies, an analysis was performed in10 using 58 wound images from the Swift private dataset, where four tissues were annotated independently by five clinicians with an one week interval. Despite the level of intra-observer agreement reaching moderate/high levels, the inter-observer agreement recorded was lower: being considered weak for epithelial (0.389) and devitalized tissue (0.591) and moderate for necrotic (0.759) and granulation (0.765). It was also verified that the clinicians’ visual estimation overestimated epithelial and necrotic tissue and underestimated devitalized and granulation tissue compared to the proportions calculated through their notes. Moreover, the error distribution between the visual estimate and calculated proportion had high variability in all tissue types with standard deviations of 38% and 39%.

Motivated by the variability observed for this task, many works use machine learning (ML) and computer vision algorithms to automate it and increase its reliability. Some methodologies simplify the wound bed characterisation as a classification problem, recognising the presence or absence of specific tissue types in the images11. For tissue segmentation, the simplest ML approaches focus on colour recognition. In12, the different regions are segmented through direct recognition of the pixels of each colour. In contrast, the methodology followed in13 considers a white reference marker and separates the regions through a thresholding process based on the HSV (hue, saturation, value) color space. An overall accuracy of 75% and accuracies of 76.2%, 63.3% and 75.1% for granulation, slough and necrotic tissues (respectively) were reported, comparable with the agreement between the different experts who annotated the ground truth (0.65-0.85)13. In14, the authors use a Convolutional Neural Network (CNN) to directly distinguish tissue types in images of pressure ulcers. Many algorithms developed for tissue differentiation first determine the open wound region. In15, the wound is divided into the largest number of regions with a homogeneous tissue type using the k-means algorithm. Then, the tissue in each region is identified using a Support Vector Machine (SVM), based on colorimetric, topological and morphological properties.16 presents an end-to-end system that uses superpixels (Spx) generated with Simple Linear Iterative Clustering (SLIC) as input to feed different neural network models (U-Net, SegNet and FCN-Net) for automatic segmentation of diabetic foot ulcers (DFU) and tissue differentiation. The best method, Spx-FCN32, outperforms classical Fully Connected Networks (FCN) models, significantly improving performance in all metrics (accuracy 92.68% and Dice 75.74%). In17, popular decoder models are compared for tissue segmentation. 2836 images of pressure ulcers, annotated using SLIC pre-processing based on superpixels, were used, obtaining an accuracy of 99.57% with the DeepLabV3 architecture.

The individual analysis of pixels in the wound region, without prior grouping into uniform tissue sub-regions, is also significant in the literature. Most works use ML models to identify the tissue in each pixel. In18, colour and texture features are extracted from the open wound region and fed to a Bayesian models and SVMs to determine the type of tissue corresponding to each pixel. Also, some studies sample the open wound area by dividing the region of interest into fixed-size patches19,20. In19, a CNN model obtained average Dice results of 91.38% and20 used a multidimensional CNN model presenting accuracy results of 99.55%. The works of García-Zapirain et al.14,21 evaluate the impact of prior segmentation of the region of the wound in the results achieved for tissue differentiation. The performance achieved using a single CNN to directly recognise the tissue type of each pixel in the entire image is similar to that achieved using separate networks to segment the wound region and the tissues within it, allowing for faster analysis of the wound, which encourages the use of a single neural network for both tasks14. Although the HSI colour space allows for less lighting impact and good contrast of the three main types of tissues, the RGB representation provides essential information for their correct discrimination. The YCrCb space is also presented as an alternative with relevant information for this task, with the related space YDbDr identified in other works22 as one of the most promising representations. In23, an approach for segmenting the edge and classifying the type of wound tissue using GANs is proposed, namely a conditional GAN (c-GAN), with a chronic wound dataset from eKare Inc. The authors evaluated the impact of the number of images used for training and the number of epochs considered, obtaining a Dice score of 90% for the best combination. Finally, in10, the AutoTissue model was trained with 17,000 anonymised image-annotation pairs from the Swift Dataset and tested with 383 images of category 2 pressure ulcers, arterial and venous and diabetic ulcers, returning an average intersection over union (IoU) of 71.92%.

Despite the emergence of several approaches for tissue differentiation, it is still an underdeveloped task, mostly due to the lack of annotated public datasets. Most works focus on open wound segmentation, driven by the Foot Ulcer Segmentation (FUSeg)24 and the Diabetic Foot Ulcer (DFUC)25 challenges, as well as the availability of the Medetec26, AZH FU27 and Wseg28 datasets. For this task, besides using methodologies similar to the ones applied for wound tissue segmentation, more complex approaches leveraging state-of-the-art models, such as vision transformers29,30, and the combination of different datasets31 have arisen, exhibiting potential to advance the performances attained.

Methods

This study proposes an automated wound bed characterisation pipeline for tissue segmentation and relative tissue proportion estimation. To establish a benchmark for automated methods, human performance on the described tasks is evaluated through agreement studies, described in Section "Open wound and tissue segmentation agreement studies", which assess the consistency of manual annotations among different raters and their alignment with the obtained consensus ground truth. Following this, the proposed pipeline, illustrated in Fig. 1, is described. The methodology, further explained in Section "Tissue segmentation", begins by pre-processing the input images, through the selection of the wound region of interest. To obtain this region, the wound and marker bounding boxes are used, which may derive from ground truth (GT) annotations or from a detection model introduced in Section "Tissue segmentation". The processed wound images are then provided to the tissue segmentation models in order to extract the delineation of each tissue across the image. As an optional post-processing step, the open wound segmentation masks coming from the GT annotations or from a previously developed open wound segmentation model32 may then be used to refine the model outputs and obtain the tissue region only contained within the wound bed. Finally, the percentage of each tissue is calculated.

Fig. 1
figure 1

Pipeline for automatic tissue segmentation in chronic wounds.

Dataset

In this work, a private Wounds dataset comprising images from several healthcare units captured using different mobile devices (smartphones) was considered. For data acquisition, a study protocol was submitted and approved by the different Health Ethical Committees, and informed consent was obtained from both the healthcare professionals and the patients involved. All experiments were performed in accordance with relevant guidelines and regulations. The images were acquired by healthcare professionals who were instructed to centre the wound in the image, including a 2 \(\times\) 2 centimetres (cm) reference marker, consisting of coloured patches (blue, green, yellow, white), in the same plan of the wound and at least 4 cm of perilesional skin. A custom-developed mobile application was used to facilitate image acquisition and ensure consistency33,34. The coloured patches were included for potential future research into colour correction methods, aimed at improving the robustness of the models to variations in lighting and imaging conditions. Most of the images were acquired using mid-range devices both Android and iOS, though some came from low-end or high-end devices. Moreover, the professionals were advised to use good natural light conditions and evenly distributed lighting when possible, avoid using the camera flash to prevent reflections on wound tissues, and take images parallel to the wound bed, aligned with the patients’ head-toe orientation. Concerning the mask annotations of the wound and its three tissue components (granulation, slough and eschar), 121 out of 307 images were manually annotated by three wound specialists (nurses with different years of experience), who first drew the boundaries separately, whereas the remaining images were annotated by only one specialist. To establish a robust ground truth mask for the wound and each tissue, the intersection of the annotations among the three specialists was identified, corresponding to the majority agreement. In cases where consensus was uncertain, specialists collaboratively refined the mask, ensuring the reliability of the annotated masks. The annotation process was conducted using a custom-built labelling tool developed in-house, as illustrated in Fig. 2. Fig. 3 shows examples of images from the Wounds dataset and corresponding annotations.

Fig. 2
figure 2

Labelling tool developed in-house for chronic wound segmentation and characterisation.

Fig. 3
figure 3

Examples of images and tissue masks from the Wounds dataset (Red - granulation, green - slough, blue - eschar).

Two examples of the masks resulting from the specialists’ annotations are represented in Fig. 4. It can be observed the level of agreement between them depicted in different colours for the open wound (yellow) and each tissue (granulation - red, slough - green and eschar - blue). In each case, the strongest colour refers to pixels where the three annotators agreed, whereas the middle shade represents the pixels where two annotators were in agreement and the softest colour represents pixels only delineated by one of the specialists.

Fig. 4
figure 4

Illustrative examples of masks annotated by three experts. The yellow masks correspond to the open wound; red, green, and blue masks stand for granulation, slough and eschar tissues, respectively.

The Wounds dataset comprises 307 images from 104 wounds of different types, as detailed in Table 1. Most wounds correspond to pressure ulcers (59), but venous, arterial and diabetic foot ulcer images are also represented in the dataset. In pressure ulcers, wounds from all categories exist, being the majority of categories 2, 3 and 4. The dataset contains images with all skin phototypes, being the majority from patients with phototypes 2 and 3. The dataset split in training (75%) and test (25%) sets, demonstrated in Supplementary Material S1, was designed to incorporate representative examples from diverse wound types, body locations and skin phototypes. Moreover, the proportion of images containing each type of tissue in the training and test set was maintained. The dataset contains wounds with sizes between 0.01 and 160 squared centimetres (average area of 13.65\(cm^2\) and 21.78\(cm^2\) standard deviation), with width and height up to 13 centimetres. In the original dataset, the open wound area covers up to 25% of the total images’ area. Statistics of the percentages of tissues inside the wound on the Wounds dataset are presented in Supplementary Material S1.

Table 1 Distribution of the Wounds dataset, with the number of wounds (#W) and images (#I) per wound type.

Moreover, the publicly available AZH FU dataset27 was also considered in our experiments, being a crucial component for the pretraining phase of our models. This dataset is composed of 1010 images, split in a proportion of 80:20 for training and testing, and the corresponding ground truth wound segmentation masks, from 889 patients.

Open wound and tissue segmentation agreement studies

Two distinct agreement analyses were conducted to assess the reliability and consistency of the wound and tissue segmentation masks provided by the experts. Firstly, we evaluated inter-rater agreement to measure the level of agreement among the three specialists. Secondly, we conducted individual rater comparisons against the generated consensus masks previously described and used in our experiments. To achieve this, the subset of 121 images annotated by various specialists from the earlier mentioned dataset (Section "Dataset") was utilised.

In both cases, the variability in boundary annotations was quantified using the agreement measure defined in Eq. (1) which corresponds to the IoU measure, where \(D_1\) and \(D_2\) refer to the regions annotated by different specialists in the first study, and, in the second case, to the regions annotated by each specialist and the consensus.

$$\begin{aligned} IoU (Agreement) = \frac{D_1 \cap D_2}{D_1 \cup D_2} \times 100 \end{aligned}$$
(1)

Additionally, to measure the annotations variability among the three raters and also in relation to the consensus, the Shrout and Fleiss intraclass correlation coefficient (ICC)35 was computed, employing a two-way mixed effects model, given that the raters assessed the same set of samples in the dataset. ICC describes how strongly units in the same group resemble each other.

The first agreement analysis (inter-rater agreement) comprised the agreement in wound and tissue mask annotations between the pairs of raters and all three, and the ICC obtained for tissue proportion estimations. It is worth noting that, in this case, the proportion of each tissue was obtained by counting the pixels belonging to the corresponding tissue type against the sum of the three possible tissues (granulation, slough and eschar) within the wound bed, although the open wound may aggregate other tissues. Regarding the second agreement study (raters agreement with consensus), the agreement of each rater’s annotations with the ground truth masks was also computed and the ICC was measured considering the IoU of the tissues masks annotated by each rater and the consensus.

Tissue segmentation

To automate the identification of the various tissues within the wound bed and determine their respective proportions, a tissue segmentation model was developed. This development comprised two parts. The first one intended to evaluate the feasibility of the proposed methodology, thus GT annotations of the open wound in terms of bounding boxes and segmentation masks were used, avoiding error propagation at each step. These annotations were respectively used in the images’ pre-processing to crop the wound region of interest (ROI) (Section "Pre-processing"), and for post-processing of the tissue masks (Section "Post-processing"). In the second part, instead of using these GT annotations, wound detection and wound segmentation models previously developed were incorporated in a streamlined pipeline, illustrated in Fig. 1, for seamless deployment in the real world. Moreover, in both parts, the previously trained wound segmentation models were used as a baseline for the fine-tuning process of the tissue segmentation models.

Pre-processing

Before entering the tissue segmentation models, the images were standardised through a pre-processing step that consisted of a cropping operation centred on the wound region. As previously stated, the experiments comprised two phases. To obtain the region of interest around the wound, in the first phase, the GT segmentation masks were used to extract the bounding boxes of the wound and reference marker to prevent error propagation in the model development. In contrast, in the streamlined pipeline, these bounding boxes were obtained through a RetinaNet with a MobileNetV2 backbone detection model33, reporting mAP@.75IOU values of 0.39 and 0.95 for the wound and marker recognition. In the Wounds dataset, a padding margin equal to 25% of the reference marker’s largest side was then applied to the cropped images to preserve contextual information, whereas, in the case of the AZH FU dataset, a tolerance of 30% of the bounding box size was considered. The two padding percentages were determined empirically. For both datasets, the cropped region was then enforced to a square shape with dimensions of \(320\times 320\) pixels for consistency and pixel intensity normalization was applied.

Implementation details

The study on tissue segmentation in chronic wounds encompassed both CNN and transformer-based models using different training approaches. The investigated models were based on the DeepLabV3+ architecture36, with a ResNet5037 backbone pre-trained on ImageNet38 (DeepLabV3-R50), and on the SegFormer architecture39(SegFormer-B0). While DeepLabV3+ uses a CNN with encoder-decoder design for segmentation, SegFormer uses lightweight transformers to capture global context without positional encodings. Specifically, we compared models trained only for tissue segmentation with models pre-trained for open wound segmentation and subsequently fine-tuned for our task. Regarding the wound segmentation models, besides being trained on the Wounds dataset, the impact of using other datasets during their training was investigated, so the AZH FU dataset was also employed. Therefore, in our work, a total of four models concerning open wound segmentation was explored as pre-trained models for tissue segmentation, namely a DeepLabV3+ and a SegFormer-based models both trained only on the Wounds dataset and on the Wounds and AZH FU datasets. These models are described in detail in32 and their performance on the Wounds dataset is reported in Table 2.

Table 2 Open wound segmentation results (in %) on the Wounds dataset, using models developed in32.

To optimize the model configuration to be used, a grid search hyperparameter tuning process in association with a stratified 3-fold cross-validation procedure was employed. For training, a maximum of 200 epochs was established, using early stopping with a patience of 10 runs. Considering the impact on computational demands, two image dimensions (\(224\times 224\) and \(320\times 320\) pixels) were explored while adopting batch sizes of 16 and 32. Moreover, the Adam optimizer with learning rate (LR) values of \(10^{-4}\) and \(10^{-3}\) was considered. With respect to the loss functions, for the DeepLabV3-R50 model, a combination of Dice loss and Cross Entropy (using equal weights) was employed, while in the case of SegFormer-B0 the Cross Entropy was used. Both training and evaluation of the models were performed using the open-source semantic segmentation toolbox MMSegmentation v1.2.140 on PyTorch v1.13.1+cu116. The experiments used a workstation with four NVIDIA Tesla A100 and V100 GPUs.

Post-processing

In order to obtain the final tissues masks, we explored the effect of intersecting the masks predicted by the tissue segmentation models with the open wound masks, hence eliminating possible artifacts outside the segmented wound. In the first case, the GT open wound masks annotated by the specialists were considered, whereas in the pipeline step, the SegFormer-B0 wound segmentation model trained exclusively on the Wounds dataset (Dice 91.55% - Table 2) was employed to obtain the wound masks. Finally, the percentage of each tissue inside the open wound relative to the sum of all three tissues was calculated.

Evaluation details

An ablation study was conducted to evaluate the influence of different training processes and processing strategies on the performance of the tissue segmentation model. Different architectures were compared, namely the CNN represented by DeepLabV3-R50, and a transformer-based model (SegFormer-B0). Additionally, the study investigated the effect of fine-tuning a model initially trained for open wound segmentation, and the influence of different datasets utilised for pretraining the models. The assessment also included an investigation into the impact of the described post-processing operations.

To assess the segmentation performance and compare it with other state-of-the-art approaches, we used IoU Eq. (1) and Dice coefficient Eq. (2), where \(D_1\) represents the consensus (GT) mask and \(D_2\) the model’s prediction. The IoU quantifies the overlap between the predicted and GT segmentation masks and ranges from 0 (no overlap) to 1 (perfect overlap); Dice is a measure of the similarity between two masks and values closer to 1 indicate higher agreement.

$$\begin{aligned} Dice = \frac{2 | D_1 \cap D_2 |}{|D_1|+|D_2|} \end{aligned}$$
(2)

To evaluate the estimated tissue proportions, the mean absolute error (MAE) was adopted. MAE is defined in Eq. (3), with n representing the total number of samples, and \(y_i\) and \({\hat{y}}_i\) the true and predicted values of the i-th sample, respectively.

$$\begin{aligned} MAE = \frac{1}{n} \sum _{i=1}^{n} |y_i - {\hat{y}}_i| \end{aligned}$$
(3)

Results and discussion

Open wound and tissue segmentation agreement studies

The findings of the inter-rater and rater/consensus agreement analysis are described in this section. This analysis established a baseline for human performance on the task, providing a benchmark to compare the effectiveness of the proposed tissue characterisation framework. Firstly, the inter-rater variability was examined to assess the consistency among the experts; then, the results of the alignment between each expert and the consensus ground truth were evaluated, to understand the accuracy of the manual annotations.

Inter-rater agreement

Table 3 provides an analysis of the open wound and tissue boundary agreement among each pair of experts and all three raters. Regarding open wound annotations, the mean agreement between any two raters ranges from 78.5% to 81.2%, declining to 71.5% when all three raters are considered. Notably, specialists 1 and 3 exhibit greater alignment, demonstrating the highest mean, minimum, and maximum agreement values. This trend is reflected in tissue agreement. Upon examination of mean and standard deviation values, the same pair of experts demonstrate the highest agreement for slough and eschar. In terms of granulation, a similar mean agreement is observed between pair 1 (1 and 2) and pair 2 (1 and 3), with a difference of 0.5%. However, pair 1 exhibits a lower standard deviation compared to pair 2, with a difference of 1.8%. Overall, agreement at tissue level is considerably lower, with average values mostly between 50% and 60% and standard deviation above 20%. Among the three tissue types, necrotic appears to be more easily identifiable due to its strong colour and well-defined contours, while granulation and slough present comparable levels of difficulty. However, it is important to note that the number of images containing eschar is considerably lower, which could potentially influence this observation. Notably, analysis of the minimum agreement values reveals cases with 0% agreement for granulation when rater 2 is involved, revealing that this specialist annotated regions with no overlap with those delineated by the other raters. Figure 5 shows examples of cases with low inter-rater agreement for each tissue.

Table 3 Inter-rater agreement (%) for wound and tissue segmentation across rater pairs and the complete set.
Fig. 5
figure 5

Examples of samples with low inter-rater agreement. The masks delineated by the raters for open wound (yellow), granulation (red), slough (green) and eschar (blue) report agreement values of 4.98%, 8.86%, 0.16% and 0.08%, respectively.

The examination of tissue proportions within the wound bed also offers insights into inter-rater variability, captured by the ICC. For the calculation of this statistic, only images containing annotations from the three specialists were considered for each tissue. There is an excellent agreement among the experts41, with granulation and eschar tissues exhibiting the highest level of consensus (0.931 and 0.932, respectively), while for slough it is slightly lower (0.910), however still classified as excellent agreement.

Compared with previous studies10, where ICC values of 0.765, 0.591 and 0.759 were reported for granulation, slough and eschar, respectively, our study indicates a higher level of consistency in expert annotations, surpassing those by approximately 0.17, 0.32, and 0.17, respectively. Similarly, when compared to another inter-reader variability study13, our findings indicate a higher level of agreement among experts regarding tissue annotations in a larger proportion of images. Specifically, when considering all three raters,13 reports average agreement values of 17.8% and 24.5% for slough and eschar, respectively, largely contrasting with our values of 43% and 63%. Conversely, the agreement values for granulation are very similar between the two studies. Regarding open wound boundary delineation, the agreement among experts in13 is slightly higher than the ones in Table 3, with higher minimum values and lower standard deviation. The higher tissue agreement observed in our study may be attributed to the fact that our specialists are all nurses with varying years of experience. In contrast, other studies10,13 involve different healthcare professionals with diverse backgrounds, potentially contributing to discrepancies in their annotations. In addition, previous studies10,13 were based on only around 50 images, whereas our study analysed the agreement based on 121 images.This analysis infers greater confidence for our proposed approach, as robust ground truth data is important for training accurate and reliable segmentation models.

Raters agreement with consensus

The results concerning each rater agreement in comparison to the GT annotations (consensus) of wound and tissue proportions may be found in Table 4. As we are comparing the masks annotated by each rater with the corresponding consensus annotation, only images where both the expert and consensus had annotations for a specific class (wound or tissue) were considered, therefore a different number of supporting images is found for each rater. In terms of wound annotations, we observe a high agreement between the masks delineated by each rater and the generated consensus mask (GT), with values above 80% for the three specialists. Overall, as it was the case in the previous analysis (Section "Inter-rater agreement"), although the agreement between clinicians and consensus at the wound level was good, their agreement at the tissue level decreased, with average values ranging from around 57.9% to 79.6%, which resulted from the inherent complexity of the problem. Comparing the agreement between each of the three experts and the GT masks (consensus), we found that both for open wound annotations and for tissue annotations, except for eschar tissue, rater 1 achieved the highest mean, minimum, and maximum agreement values with the consensus masks, showing greater proximity to the GT masks used in our experiments. In terms of tissues, the eschar tissue also achieved the highest agreement, which, apart from the smaller number of images, may result from the easier differentiation in the colour of its boundaries.

Table 4 Rater agreement with consensus segmentation for open wound and tissues (%).

The agreement between each expert’s and the consensus (GT) masks was also quantified using the ICC applied to the IoU values for each tissue type. The high ICC values obtained for granulation (0.803) and slough (0.883) tissues demonstrate good reliability in the experts’ annotations relative to the consensus. Nevertheless, for the eschar tissue, we find an opposite trend to that seen in Table 4, as, in this case, this tissue reveals the lowest consistency, although a moderate ICC value is still observed (0.581). Besides inter-rater variance, ICC also measures how consistent the IoU with the consensus masks are among each rater. Therefore, as a lower number of images was available for this tissue, the estimations were more prone to fluctuations that may have impacted ICC calculations but that were not reflected in the average agreement values found in Table 4.

Overall, the results for the inter-rater and rater/consensus agreement analysis demonstrate the variability in visual estimation, reflecting the tissue differentiation challenge. To reduce ambiguity and provide objectivity in this task, a framework for deep learning–based tissue segmentation and proportion estimation was presented and its results are discussed next.

Tissue segmentation

The results of the model optimisation experiments are presented, followed by an ablation study to select the best architectures, fine-tuning methods and post-processing operations and its impact on the performance of the proposed framework, regarding tissue segmentation and proportion estimation.

Hyperparameter tuning

To determine the optimal hyperparameter configuration for each model, we compared the performance in terms of the IoU metric achieved across various experiments using different combinations of image sizes (\(224\times 224\) and \(320\times 320\)), batch sizes (16 and 32), and learning rates (\(10^{-4}\) and \(10^{-3}\)). The average cross-validation results are depicted in Supplementary material S2. The most effective hyperparameter combination for each model (highlighted in the image) was chosen for the ablation study.

Ablation study

We performed an ablation study to investigate the impact of different architectures, fine-tuning models pre-trained in a simpler task from the same domain with different datasets, and post-processing operations on the overall performance. The results are presented in Table 5 and infer the effectiveness of different training strategies in improving the segmentation accuracy of chronic wound tissues. Open wound segmentation models fine-tuned for tissue segmentation consistently outperformed the others across all classes. Leveraging prior knowledge in open wound segmentation has proven advantageous across all tissue types. By initially training the model on this specific domain, which presents a relatively simpler task, we provided the model with a solid foundation. Subsequently, this knowledge was successfully applied to tackle the more intricate challenge of segmenting different regions within the wound. Furthermore, the effectiveness of the post-processing step, wherein tissue predictions are intersected with the open wound mask, is also confirmed, as this approach positively influenced the results for each tissue class.

By inspecting Table 5, the DeepLabV3-R50 model, pre-trained for open wound segmentation and fine-tuned for tissue segmentation with the Wounds dataset with an input size of \(320\times 320\), batch size of 32 and LR equal to \(10^{-4}\), emerges as the top-performing approach, yielding the highest mIoU and mDice scores: 62.95% and 76.82%. Despite the limited representation of necrotic within the dataset, the model achieved Dice scores of approximately 83% for this class. Granulation exhibited a performance of 81.33% for the same metric. These outcomes are not surprising given that both components typically exhibit well-defined boundaries. Slough emerged as the most challenging category, with the model returning a Dice of 66.13%, while none of the other models attained values above 68%.

The performance of our best model falls short of the AutoTissue approach proposed in10, which achieves a mIoU of 71.92%.16 and17 employed superpixel-based approaches, dividing the image into fixed regions for tissue classification. The first study reported an accuracy of 92.68% and Dice of 75.74%, while the latter achieved precision, recall and accuracy scores of 99.15%, 99.15% and 99.57%, respectively. Moreover, the authors in17 excluded images where annotators could not reach a consensus. In20 and19, the authors also performed tissue classification in patches of \(5\times 5\) pixels, with the former yielding an accuracy of 99.55%, specificity of 98.06% and sensitivity of 95.66%, and the latter reporting classification accuracy of 92.01% and an average total weighted Dice of 91.38%. These methodologies with fixed regions reduce the error, compared to our proposed pixel-wise approach. Furthermore, the results obtained from these approaches may not be directly comparable, as each method is tailored to the specific characteristics and requirements of its own dataset and is evaluated on that particular dataset. Variables such as image resolution, diversity of wounds, and annotation protocols can vary between datasets, influencing the performance of the algorithms. Therefore, while these approaches may achieve impressive results within their respective datasets, their applicability and generalisability to other datasets, including ours, may be limited.

Overall performance of the proposed framework

Table 5 presents the results obtained for the complete pipeline, thus incorporating both detection and segmentation models. Since the open wound detection model failed to detect wounds in a number of samples, only 65 out of the initial 78 images contained in the test set were utilised for segmentation evaluation. With the exception of the Segformer-B0 model pre-trained for open wound segmentation on the Wounds dataset, we can find a positive impact of incorporating domain knowledge regarding a simpler task (open wound segmentation) when training our models, as the fine-tuned models exhibit higher performances when compared to the models merely trained for tissue segmentation. Similarly to the results verified in Table 5 which concern the GT-based pre-processing approach, the DeepLabV3-R50 model pre-trained for open wound segmentation on the Wounds datasets emerged as the top-performing model, achieving the highest mIoU and mDice scores. The combined approach of detection and segmentation then demonstrated notable robustness, with only a minor 2.5% reduction in overall segmentation performance for the best-performing model (mDice decreasing from 76.82% to 74.38%) relative to the results illustrated in Table 5. Moreover, comparing the outcomes for each tissue, the trend persists, with slough tissue being the most difficult to identify due to its more challenging colouring, presenting the lowest Dice score (64.67%). In this case, granulation and eschar tissues achieved similar results corresponding to a Dice of 79.45% and 79.01%, respectively.

Table 5 Results (%) of tissue segmentation for the test set of the Wounds dataset for the ablation study and proposed framework approaches. Wound Inters. denotes if the predicted tissue masks are intersected with the corresponding open wound masks.

Statistical comparisons were performed on the two top-performing models in terms of mean Dice, namely the two DeepLabV3-R50 open wound segmentation models fine-tuned for tissue segmentation. Both models were initially trained with the Wounds dataset, with one additionally pre-trained using the AZH FU dataset. The comparisons were conducted at a significance level of 95% (\(p<0.05\)). To identify the most appropriate test, the normality of the IoU and Dice distributions was assessed using the Shapiro–Wilk test. Given the non-normality of the data and the paired nature of the samples, the Wilcoxon signed-rank test was employed and no significant difference between the models under consideration was found. To further compare the performance of the models, a visual inspection was conducted, as well as the impact of their corresponding predicted tissue masks on the final goal of the framework: tissue proportion estimation.

Examples of the predictions from the top-performing models are depicted in Fig. 6, where, for simplicity’s sake, the DeepLabV3-R50 model pre-trained for open wound segmentation on the Wounds datasets is referred to as “model A” and the DeepLabV3-R50 model pre-trained for open wound segmentation both on the AZH FU and the Wounds datasets corresponds to “model B”. The top row showcases satisfactory predictions with mDice values of 66.75% vs 56.32%, and 92.21% vs 86.25% for models A and B, respectively. In the middle row, segmentations with mean Dice scores of 31.40% vs 64.61%, and 52.3% vs 49.43% are presented. Lastly, the bottom row displays flawed instances, with mDice of 56.14% vs 65.88%, and 54.78% vs 26.37%. These examples show the importance of visual examination. Despite the higher mean Dice scores in the last row, predictions in the middle row appear more sensible: the example on the left contains minor misclassifications of eschar as the other tissues, which penalize the overall result; on the right, there is granulation tissue misidentified as slough by both models, a reasonable error given its lighter colour, and the necrotic tissue is not recognised within the wound bed, leading to low Dice score for this class. The bottom examples illustrate poor predictions: the left sample displays excessive slough segmentation in both models, likely attributed to the lighter tone caused by reflections. Additionally, there is minimal tissue identification in the centre of the wound in the example on the right and model B completely misidentifies the granulation and necrotic tissues. Comparing the two approaches, model A not only achieves higher mDice scores in the provided examples but also consistently delivers more visually coherent results. The superior performance of model A implies that pretraining on a domain-specific dataset (Wounds dataset) is more beneficial than combining it with additional datasets (AZH FU) that may introduce variability without significant benefit.

Fig. 6
figure 6

Examples of tissue predictions generated by the pipeline for the test set images. The first row demonstrates accurate predictions, while the second row illustrates cases where predictions are understandable (upon visual examination) but not optimal. The third row depicts instances of unsatisfactory predictions. Red represents granulation, green is slough and blue is eschar. Model A stands for the DeepLabV3-R50 model pre-trained for open wound segmentation on the Wounds datasets and Model B for the DeepLabV3-R50 model pre-trained for open wound segmentation on the AZH FU and Wounds datasets.

The MAE for the tissue proportions calculated from the tissue masks generated by the DeepLabV3-R50 tissue segmentation models, previously trained for open wound segmentation on the AZH FU and/or Wounds datasets, for the test images was computed, as in Table 6. Model A, pre-trained solely on the Wounds dataset, consistently demonstrated lower errors for all tissue types as well as lower standard deviation, indicating reduced variability, compared to model B. This further supports the conclusion drawn from visual examination.

No reports of the MAE for tissue proportion estimation were found in the literature, preventing direct comparisons of our results. A valuable follow-up study would involve collecting visually estimated wound tissue proportions from specialists in real-world settings, hence evaluating the reproducibility of the automated approach compared to expert assessment.

Table 6 Mean absolute error of tissue proportions obtained by the pipeline in the test set. The models compared are the DeepLabV3-R50 tissue segmentation models, pre-trained for open wound segmentation on the AZH FU and/or Wounds datasets.

Conclusions

This work demonstrates the potential of an automated approach for chronic wound tissue segmentation and tissue proportion estimation by leveraging deep learning and domain knowledge. This study introduces a novel and comprehensive tissue segmentation dataset, encompassing a wide range of wound types and appearances. Furthermore, an inter-rater agreement analysis confirmed the high annotation quality of the constructed dataset, while also highlighting the inherent complexity of tissue delineation and providing important context for interpreting the model’s performance. Using this dataset and after exploring different training conditions, a DeepLabV3-R50 model first pre-trained for a simpler task (open wound segmentation) and then fine-tuned for tissue segmentation on a private Wounds dataset was selected to predict the tissue regions and subsequently estimate their percentage within the open wound. This resulted in Dice scores of 79.45%, 64.67% and 79.01% for granulation, slough and eschar tissues, respectively, and MAE for proportion estimation of (\(14.33\pm 16.05\))%, (\(14.31\pm 15.28\))% and (\(8.84\pm 5.29\))% for the same tissues. The impact of the explored conditions in the final performance emphasizes the importance of considering different factors, including network architecture, fine-tuning strategies, dataset diversity and post-processing operations, in optimizing the performance of the models.

Future work will focus on clinical validation and exploring a data-centric approach to improve the quality of the considered dataset by identifying images with lower quality and under-represented sample types, ensuring a more representative data coverage.