Introduction

The transition from traditional microscopy to Whole Slide Imaging (WSI) has revolutionized pathology, enabling the quantitative study of pathological tissues through advanced computer vision techniques applied to high-resolution digital slides1,2. This technological leap has laid the foundation for integrating artificial intelligence (AI) into liver pathology, with the promise of standardizing tissue analyses and enhancing research through precise assessment of morphological elements and improved integration with ancillary methods such as in situ -omics3. Despite the increasing application of AI in liver pathology (e.g., fibrosis and Metabolic dysfunction-Associated Steatotic Liver Disease -MASLD), rare inflammatory diseases with significant clinical burden remain largely underexplored4,5. Histological evaluation plays a crucial role in the diagnostic workflow of these conditions and requires detailed evaluation of inflammation, considering its localization and spread across specific compartments6.

Although some commercial applications exist for dividing liver tissue into its main components, their datasets and intended uses are often restricted to specific diseases and usually lack an external validation7. Moreover, there is a lack of reliable open-source options that would enable pathologists and researchers to perform automated zonation of liver tissue8. This capability could unlock various applications, not limited to studying the localization of inflammatory infiltrates, but also extending to tasks such as data coregistration across advanced methodologies through automated annotation and targeted region selection for studying specific compartments, thereby saving computational resources9. As well as other promising computer vision techniques, such as weakly supervised classification methods, as demonstrated in our previous research, these strategies offer significant potential for advancing tissue analysis10.

This study aims to bridge this gap by developing an AI-based approach for automating the segmentation and semantic recognition of portal and lobular compartments in liver biopsies.

Results

Cases

A total of 223 cases were collected for training the model, encompassing a diverse range of conditions. The largest group consisted of cases classified as AIH (85 cases, 38%), followed by primary biliary cholangitis (PBC, 54 cases, 24%), normal or near normal liver parenchyma (25 cases, 11%), drug-induced liver injury (DILI) (16 cases, 7%), MASLD (15 cases, 7%), graft-versus-host disease (GVHD) (10 cases, 5%), primary sclerosing cholangitis (PSC, 10 cases, 5%), infectious disease (5 cases, 2%), and iron overload disorder (3 cases, 1%). The sex distribution within the study was nearly balanced, with females constituting 59.64% (132 cases) and males 40.36% (90 cases) of the total. The mean age of participants was 50.2 years, ranging from a minimum of 8 years to a maximum of 85 years (median 50). The test set comprised 16 AIH cases and eight PBC cases from five institutions external to the training and validation datasets, along with five GVHD cases from one internal institution. After tile extraction, the dataset comprised 54,338 patch pairs at 1 μm/px magnification. Following an 80/20 split, 43,470 patches were allocated for training and 10,868 for validation. At 2 μm/px magnification, the dataset contained a total of 28,197 patches, which were similarly divided into 22,558 for training and 5639 for validation. The test set included 8789 patches at 1 μm/px magnification and 4611 patches at 2 μm/px magnification. In the training and validation sets, the dataset composition was ~17% background, 71% hepatic lobule regions, and 12% portal areas. In the test set, the corresponding proportions were ~28% background, 61% hepatic lobule regions, and 11% portal areas.

Model development

The model was trained over a total of 100 epochs, with early stopping triggered at epoch 49 based on the validation loss, preventing further unnecessary training and reducing computational costs (Fig. 1). The overall training time was 10.6 h, and the average time per epoch was 13.1 min. The resulting model, called “Hematoxylin & Eosin-based Open-access Tool for Segmentation of Portal Tracts” (HOTSPoT), achieved a mean Dice coefficient of 0.92, 0.92 and 0.91 and a mean IoU of 0.86, 0.85 and 0.84, respectively, on the training, validation and test set. Detailed performance metrics are provided in Table 1, which presents the mean and per-class Dice coefficient and IoU values. Segmentation performance across the test set is further elaborated in Table 2. The average inference time for WSI, considering the test set was 1.3 s, while full test set inference time is 38 s, indicating scalability to larger datasets. The average time for inference and deployment on QuPath was 20 s (range 5 s–1.2 m). Variations in time were primarily influenced by the preprocessing steps of the tissue finder, which analyzed the entire slide to handle cases with dark or noisy backgrounds effectively. The QuPath implementation allows the graphic visualization of the automatic segmentation process, as shown in Fig. 2.

Fig. 1: Training process of the HOTSPOT model.
figure 1

The training process illustrates the model’s IoU and dice coefficient improvement over the epochs, with steady improvement in both training and validation performance. IoU Intersection over Union.

Fig. 2: Graphic visualization of the automatic segmentation process.
figure 2

Inference via the constructed model and tile level segmentation were automatically imported as geoJSON objects into QuPath for image visualization and analysis. On the left, biopsy of a patient with cirrhotic liver in AIH. On the right, biopsy of a non-cirrhotic patient with quiescent PBC.

Table 1 Model performance on training, validation and test sets overall and divided by compartment of interest
Table 2 Summary of test set performance metrics across different institutions and diseases

Upon completion of the training, the model was exported as a fully traced and weight-optimized TorchScript file for efficient deployment and inference. HOTSPoT was implemented in Python using the PyTorch 2.3 framework. The source code together with the torchscript full traced model, is available in the GitHub repository at the following URL: https://github.com/Gizmopath/HOTSPoT.

A modified version of the existing WSInfer library, designed for the deployment of WSI-level models in TorchScript format, was implemented and used for slide-level inference11. This library was built and used in previous projects and it is easily editable to fit any segmentation project12. The inference is supported at the WSI level, with model predictions being converted into GeoJSON files, an open standard for representing geographical features, that can be imported into QuPath. Within the ReverseQuPath folder, the source code required to run inference at WSI level using a pixel-based model is provided at the following URL: https://github.com/Vsc0/nutshell.

HOTSPoT clinical deployment: biopsy adequacy assessment

The clinical validation of the model included a total of 35 liver biopsy cases. The clinical diagnoses comprised autoimmune hepatitis (AIH, n = 4), drug-induced liver injury (DILI, n = 8), metabolic dysfunction-associated steatohepatitis (MASH, n = 6), primary biliary cholangitis (PBC, n = 2), ductal plate malformations (n = 3), primary sclerosing cholangitis (PSC, n = 3), and vascular/sinusoidal diseases (n = 6). Among these, 26 patients presented with a Ludwig fibrosis score of less than 3 at the time of diagnosis; inference on whole-slide images yielded a mean of 16.96 ± 6.08 portal tracts per case, showing a very high level of agreement with both observers (κ = 0.90 with original diagnosis and κ = 0.80 with the second observer).

The mean total biopsy area across the 35 cases was 19.56 ± 7.88 mm², while the mean manually assessed length was 1.94 cm (r = 0.92, p < 0.001). Portal area percentages quantified by HOTSPoT on the total tissue area (mean 0.13 ± 0.14%) differed significantly across the five Ludwig fibrosis stages (0–4), as shown in Fig. 3, with a correlation with the histological fibrosis of 0.87 (p < 0.001), suggesting that HOTSPoT quantification may serve as a useful surrogate for immediate “macroscopic” fibrosis assessment without special stains.

Fig. 3: Portal area quantification by HOTSPoT correlates with Ludwig fibrosis stages.
figure 3

Box plots showing the percentage of biopsy area occupied by the portal region relative to the total tissue area, stratified by histological fibrosis class. A strong positive correlation between portal area percentage and fibrosis class was observed (Pearson’s r = 0.87, p < 0.001).

Discussion

The segmentation of portal areas in the liver has become an area of increasing interest in AI-based research, due to its potential to automate tasks that require complex manual annotations13. One of the key goals of this study was to develop a model capable of generalizing across a wide range of liver biopsy samples, obtained from different institutions and scanned with various devices14. HOTSPoT’s generalizability is demonstrated by its performance on test sets from multiple centers, which used distinct scanners, magnifications, and pre-analytical protocols, suggesting that its architecture is resilient also to geographical diversity of the dataset. The scalability of HOTSPoT can be partially attributed to its transformer-based architecture, which is known for its ability to handle large, complex datasets and retain contextual information across various regions of interest. Traditional convolutional neural networks (CNNs) often face limitations when processing high-resolution images or when capturing long-range dependencies across distant regions in an image15. Additionally, in this experiment, data augmentation techniques were conducted under human supervision to ensure that the generated images remained realistic and reflective of typical pathological variability, avoiding the introduction of artificial artifacts. By simulating the diverse conditions encountered in real-world pathology practice, these techniques enhanced the model’s ability to generalize across different tissue appearances, staining quality, and slide artifacts, improving its performance on unseen biopsy samples while reducing the risk of overfitting16,17. HOTSPoT’s scalability is further demonstrated by its small size (approximately 15 MB) and its ability to perform inference almost instantaneously, even on large datasets18. Additionally, its seamless integration with QuPath enables general pathologists to incorporate advanced AI tools into their workflow, facilitating research and enhancing the potential for broader adoption and contribution to scientific practice19. In particular, the model empirically demonstrated superior performance on clean, thin, artifact-free sections, as reflected by differing metrics across the various validation centers. For example, Fig. 4a (portal tract) and b (centrilobular vein) shows a nearly perfect segmentation, whereas the presence of artifacts (such as pen markings, dust particles, and slide imperfections) led to misclassification of non-tissue regions as tissue (Fig. 4c). The test set, which includes multiple cases of both chronic and acute hepatitis, serves as a rigorous stress test for a portal-space recognition algorithm: disease-induced alterations can be profound, and centrilobular regions may closely resemble portal spaces. The outstanding performance metrics attest to the algorithm’s robustness, although it can occasionally “capture” small, highly inflamed lobular areas in error (Fig. 4d).

Fig. 4: Variability of the algorithm across different settings.
figure 4

Segmentation performance of the model under varying conditions. Panel (a) shows accurate identification of a portal tract in a clean, artifact-free section, while (b) illustrates successful segmentation of a centrilobular vein. In contrast, (c) highlights misclassification caused by artifacts such as pen markings and debris. Panel (d) demonstrates an instance where the algorithm erroneously captures inflamed lobular areas resembling portal spaces in the context of hepatitis.

Several experiments already exist in the literature, with various technologies, algorithms, and datasets developed to address this challenge, providing valuable insights for comparison in recent studies. A study published in the Journal of Pathology Informatics in January 2022 focused on the automated detection of portal spaces and central veins in liver tissue images of rats, achieving an F1 score of 0.81 compared to expert manual annotations. However, when applied to steatotic liver tissue, the model’s performance dropped to an F1 score of 0.59, demonstrating the difficulty in generalizing the model to unseen pathological tissues during training20. A study by Hanyi Yu et al. in 2022 proposed the segmentation of portal areas in human liver biopsies. The introduced model, named MUSA-UNet, is a variant of the traditional UNet architecture. The authors used a dataset of histological images from 53 transplant patients stained with H&E and Masson’s trichrome, achieving precision of 0.94, recall of 0.85, F1 score of 0.89, and accuracy of 0.89 on the validation set21. Hung-Wen Tsai et al. tackled the segmentation of periportal regions infiltrated by lymphocytes, an important aspect for diagnosing and classifying chronic hepatitis. Their advanced framework integrates automatic segmentation based on a variant of DeepLabv3, achieving IoU values of 0.905 for portal spaces and 0.744 for bile ducts. In the second step, lymphocyte infiltration was detected by analyzing the protrusions at portal boundaries, identifying suspicious infiltration areas22. In 2024, Caner Ercan et al. introduced AI(H), an AI-based model using convolutional neural networks (CNNs) to analyze liver biopsies for autoimmune hepatitis (AIH). The training dataset consisted of 123 pre-treatment liver biopsies from confirmed AIH cases, analyzed on a proprietary platform. The AI(H) model achieved 99.4% accuracy in tissue detection, 88.0% for hepatic microanatomy, 83.9% for necroinflammation features, 81.7% for bile duct damage, and 79.2% for portal inflammation23.

Despite significant progress, these studies have certain limitations. Most did not release open-source models, which would have facilitated broader adoption within the scientific community. Additionally, many datasets were derived from a single clinical center, potentially limiting the models’ applicability and generalizability24. External validation remains a critical step in ensuring the robustness and reliability of AI models for histological analysis. While high accuracy can often be achieved on training datasets, evaluating model performance on diverse, independent data from different institutions is essential to enhance generalizability and reduce the risk of overfitting25,26. Furthermore, the development and adoption of open-source solutions in pathology is increasingly important for advancing research and automated diagnostics. Open access to software and algorithms promotes collaboration, reduces barriers associated with proprietary systems, and fosters innovation and transparency within the field27.

As a first immediately clinically implementable application, the high concordance metrics achieved demonstrate that our model can both precisely quantify two-dimensional tissue area (mm²) and automatically count the number of portal tracts—complementing traditional manual adequacy assessments with a more comprehensive appraisal of specimen adequacy28. Furthermore, by producing a fully segmented biopsy -delineating portal tracts, interface (one hepatocyte diameter from the tract), and lobular regions- the tool lays the groundwork for future pathology workflows in which segmented images could be used to guide more precise measurements of inflammatory infiltrates, fibrosis staging, and other morphometric features, ultimately supporting more accurate and reproducible diagnoses.

However, HOTSPoT holds significant potential for advancing liver pathology in several key applications (Fig. 5). It can extract morphometric features and localize cells or clusters within liver tissue, enabling a multidimensional analysis that includes morphometric data, being particularly promising for inflammatory and autoimmune conditions29. Additionally, it optimizes computational resources by automating liver region segmentation, enabling targeted analysis of specific compartments, such as lobules (e.g., steatosis quantification) or portal areas (e.g., portal tract study), in line with an ecologically sustainable use of AI in pathology30,31. HOTSPoT’s advanced segmentation capabilities support the integration of spatial-omics data, such as proteomics or transcriptomics, with histological tissue context. By providing precise spatial annotations, it facilitates mapping molecular markers to specific liver regions, offering valuable insights into disease mechanisms and responses to treatment32.

Fig. 5: Clinical and research applications of the HOTSPoT liver biopsy segmentation model.
figure 5

HOTSPoT enables advanced liver pathology analysis by extracting morphometric features, localizing cell clusters, and supporting targeted compartment analysis. Its segmentation capabilities facilitate efficient resource use and integration of spatial-omics data with histological context, offering insights into disease mechanisms and treatment responses.

Foundation models, such as large pretrained Vision Transformers (ViT) or models like Swin Transformer, are increasingly being explored in medical imaging. These models are typically pretrained on massive datasets like ImageNet and then fine-tuned for specific tasks. While foundation models show great promise in areas like classification, their applicability in pathology is still somewhat limited by their focus on neoplastic conditions33,34. Pathological studies, particularly in liver diseases, require specialized models that can handle the complexities of tissue segmentation, especially in non-neoplastic contexts, where inflammatory and autoimmune conditions are not sufficiently represented in the large, pretrained datasets typically used in foundation models. This challenge calls for the development of models that can be fine-tuned for non-neoplastic liver diseases, as well as the creation of specialized datasets for these diseases.

While the annotation of portal tracts is considered a low-variability task due to their clear histological definition, the use of a single expert may inherently limit the assessment of inter-observer consistency. Although no evidence of significant variability is reported in the literature for this specific feature, future studies including multiple annotators could further confirm the robustness and generalizability of the annotations. Variations in scanner types, magnifications, and staining protocols were considered during model development; however, potential biases introduced by site-specific differences in sample preparation and digitization may still impact performance in real-world clinical settings.

This study presents a promising approach for automating the segmentation and semantic recognition of liver tissue in WSI using a transformer-based model. HOTSPoT demonstrated high performance across a variety of scanning protocols and pathological conditions, showcasing its potential for standardizing liver tissue analysis and supporting research in rare liver diseases. Furthermore, the trained model is available for free, enabling local use and fostering broader accessibility within the research community.

Methods

Cases

Hematoxylin and eosin (H&E) stained liver biopsies WSIs were retrieved from three different institutions (Fig. 6): IRCCS Fondazione San Gerardo dei Tintori (Monza, Italy); Institute of Pathology, Hannover Medical School (Hannover, Germany) and ASST Grande Ospedale Metropolitano Niguarda (Milan, Italy), and anonymized. These slides were digitized using four different scanners at different magnifications. Demographic (sex and age) and key pathological data (final diagnosis, Ishak fibrosis stage) were collected for each case. Additionally, 20 cases from four external institutions: Hospital Clínic de Barcelona (Barcelona, Spain), Hospital Italiano de Buenos Aires (Buenos Aires, Argentina), Hospitais da Universidade de Coimbra (Coimbra, Portugal), Maastricht University Medical Centre (Maastricht, Netherlands) and Policlinico di Palermo (Palermo, Italy) and 5 cases from one institution internal to the training/validation set were used as a test set, each scanned with a different institution-device combination from those used in the training/validation set. Technical and scanning characteristics of the whole dataset are reported in Table 3. All participating institutions are recognized as expert referral centers for liver diseases and (in part) members of the European Reference Network for rare liver diseases.

Fig. 6: Study design.
figure 6

H&E-stained liver biopsy WSIs were collected from multiple expert liver pathology centers across Europe and South America, digitized using diverse scanners and magnifications. The dataset includes training/validation cases from three institutions and an independent test set from five external centers, ensuring variability and robustness across acquisition settings.

Table 3 Summary of the sample size and technical characteristics of H&E slides from the centers involved in the study

Approval was obtained from the local ethics committees of each institution. The coordinating center is University of Milano-Bicocca which coordinates the PBC Database (NCT05151809) and AIH Database (NCT06078098), approved by CET Lombardia 3.

Annotations

Tissue on the biopsy was identified using QuPath’s custom tissue finder with a thresholding approach35. Portal tracts were annotated within the same software based on a standard definition (presence of at least two out of three portal triad structures, including at least one portal artery or interlobular duct)36 and semantically delineated from the lobular compartment by a hepatopathologist (GC). As the resolution of each WSI is in the order of tens of thousands of pixels and the region of interest (ROI) size varies among the different images, a tiling procedure was applied on the ROIs to obtain smaller images—of a fixed size—suitable as input for the transformer model. Tiles of 256 × 256 pixels (no overlap) were extracted from tissue areas at a magnification of 1 μm/px (approximately 10× magnification). These tiles (original images) were extracted along with their corresponding masks, in which distinct colors were used to label background, liver lobule tissue, and portal areas. The same extraction procedure was also performed at a magnification of 2 μm/px. The percentage of pixels belonging to each of the three classes (see below) was computed in the training/validation and test sets to provide an estimate of class distribution across the dataset.

Model development

This experiment focused on training a deep learning model for multi-class semantic segmentation using the SegFormer architecture, fine-tuned for segmenting images into three distinct classes: background, liver lobule, and portal areas. The model utilized a pretrained SegFormer-b0 structure (nvidia/segformer-b2-finetuned-ade-512-512), which was adapted specifically for this task using images and their corresponding masks37.

Data augmentation, including rotations, flips, elastic transformations, grid distortions, and color jittering, was applied to images and masks, with strength tailored by a pathologist to ensure realism and reflect laboratory variability without introducing artifacts. Early stopping based on validation loss prevented overfitting, and performance was evaluated at each epoch using metrics such as intersection over union (IoU) and Dice coefficient both overall and per class. Dice coefficient measures the similarity between the predicted and ground truth segmentations by evaluating how much they overlap relative to their total size at patch level, while IoU evaluates the ratio between the overlapping area (intersection) and the total combined area (union) of the prediction and the ground truth. The inference at the WSI level was carried out using a modified version of the WSInfer library, specifically adapted for deploying WSI-level models in TorchScript format. This library is fully customizable to accommodate any segmentation workflow. No overlapping between tiles was used during tile-based inference; instead, model predictions are output as GeoJSON files, an open standard for representing geographic features, which can then be seamlessly imported into QuPath. Training and inference were performed on a machine equipped with 16 GB of RAM and an NVIDIA GeForce GTX 1070 GPU with Max-Q Design.

HOTSPoT clinical deployment: biopsy adequacy assessment

The resulting model (HOTSPoT), was applied to evaluate the adequacy of 35 consecutive cases, including 25 internal cases from the IRCCS Fondazione San Gerardo dei Tintori (Monza) and 10 consultation cases referred to the same unit, without restriction by underlying disease. WSI-level inference was performed on each case to identify portal regions, which were automatically labeled and counted, providing a continuous estimate of the total number of portal tracts where feasible, specifically in cases with Ludwig stage <3. For all 35 cases, the total biopsy area on the slide (in mm²) was extracted, as well as the cumulative portal tract area and its proportion relative to the entire biopsy. The same portal-tract count was also extracted from the original pathology report and was independently assessed by a second observer (AQ), allowing comparison of both interobserver agreement and the performance of the model (HOTSPoT). Biopsy length (in cm) was manually measured for each case. Agreement between the model and human observers in counting continuous portal tracts was assessed using Weighted Cohen’s Kappa. To evaluate the reliability of HOTSPoT as a surrogate for immediate macroscopic assessment of histologic fibrosis, Pearson correlation coefficients were calculated between HOTSPoT’s biopsy area quantification and manual length measurements, as well as between HOTSPoT’s portal area percentage and histological fibrosis scores.