Fig. 2: HistoGPT simultaneously learns from vision and language to generate histology reports from whole slide images. | Nature Communications

Fig. 2: HistoGPT simultaneously learns from vision and language to generate histology reports from whole slide images.

From: Generating dermatopathology reports from gigapixel whole slide images with HistoGPT

Fig. 2

a HistoGPT is available in three sizes (Small, Medium, and Large). It consists of a patch encoder (CTransPath for HistoGPT-S/HistoGPT-M and UNI for HistoGPT-L), a position encoder (used only in HistoGPT-L), a slide encoder (the Perceiver Resampler), a language model (BioGPT base for HistoGPT-S, BioGPT large for HistoGPT-M/HistoGPT-L), and tanh-gated cross-attention blocks (XATTN). Specifically, HistoGPT takes a series of whole slide images (WSIs) at 10×–20× as input and outputs a written report. Optionally, users can query the model for additional details using prompts such as “The tumor thickness is”, and the model will complete the sentence, e.g., “The tumor thickness is 1.2 mm”. b We train HistoGPT in two phases. In the first phase, the vision module of HistoGPT is pre-trained using multiple instance learning (MIL). In the second phase, we freeze the pre-trained layers and fine-tune the language module on the image-text pairs. To prevent the model from overfitting on the same sentences, we apply text augmentation with GPT-4 to paraphrase the original reports. c During deployment, we use an inference method called Ensemble Refinement (ER). Here, the model stochastically generates multiple possible reports using a combination of temperature, top-p, and top-k sampling to capture different aspects of the input image. An aggregation module (GPT-4) then combines the results to provide a more complete description of the underlying case. Source data are provided as a Source Data file.

Back to article page