Background & Summary

Brain tumours account for a large fraction of years of potential life lost as compared with tumours from other sites1, and have a significant negative impact on patients’ quality of life2. Overall, they are relatively uncommon neoplasms with an incidence of approximately 24 per 100.000 person-years3. Current diagnostic guidelines published by the WHO define approximately 150 distinct brain tumour types and assign grades I to IV, based on malignancy and potential to malignant transformation or progression. They are mainly differentiated by their histopathological phenotypes and molecular alterations4. While the majority of tumours is diagnosed solely based on histopathology, an integrated approach is mandatory for 19 tumour types.

Still, more accurate diagnostic distinctions are needed in order to i) better assess individual patients’ prognoses and ii) support more robust therapeutic decisions4,5. Recently, diagnostic algorithms trained on DNA methylation data have been shown to significantly increase diagnostic accuracy6. Similar advances focusing on histopathological data have been hampered, so far, by the lack of freely available histopathology datasets7. Most available histopathology data such as those available through TCGA8, IvyGAP9,10 or TCIA11 focus on only a few diagnostic entities. They mostly consist of digitized fresh frozen tissue sections, which feature relatively poor tissue morphology as compared to formalin-fixed and paraffin-embedded tissues. Still, even with these limited data, computational algorithms have been successfully trained - amongst others - for survival prediction12, detection of tumour-infiltrating lymphocytes13, and assessments of tumour microvessels14. However, larger datasets encompassing an even wider range of brain tumours and featuring improved cellular and morphological characteristics are necessary to further develop these algorithms and extend their applicability to the entire spectrum of brain tumour types.

Thus, we set out to compile a comprehensive resource of digitized Haematoxylin-eosin(H&E)-stained brain tumour whole slide images (WSIs) with clinical annotations (Fig. 1). We aimed to capture the complete spectrum of brain tumours as encountered in day-to-day medical diagnostic practice. Importantly, we managed to specifically digitize slides of exceedingly rare pathologies, which are usually, if ever, seen only a few times in a pathologist’s lifetime. By performing a manual review of each slide, we ensure high scan quality and actuality of provided diagnoses. We envisage this dataset to be used for advancing digital pathology-based machine learning and for teaching purposes. Importantly, this dataset can be used for (1) inter-tumour comparisons thanks to the wide inclusion of distinct brain tumour types as well as (2) within-tumour-type investigations thanks to the inclusion of a large number of samples for the common tumour types.

Fig. 1
figure 1

Overview of the data acquisition and publication process. First, histological slides and clinical records of brain tumour patients were retrieved from the biobank of the Division of Neuropathology and Neurochemistry, Medical University of Vienna. Then, slides were digitized using a Hamamatsu slidescanner. Clinical data were translated into standardized annotations. At least two experienced neuropathologists checked each slide scan to ensure conformity of the diagnosis with the current revised 4th edition of the “WHO Classification of Tumours of the Central Nervous System” and sufficient scan quality. Ambiguous cases were excluded and WSIs of inferior quality were re-scanned. Finally, data were made available via EBRAINS to the international research community. (Brain illustration adapted from Meaghan Hendricks from the Noun Project).

Methods

Sample acquisition

H&E stained tumour slides from FFPE tissues, which were collected for routine diagnostics in the time interval of 1995–2019 have been obtained from the biobank of the Division of Neuropathology & Neurochemistry, Medical University of Vienna. We digitized each slide in high magnification (40x objective, 228 nm/pixel) using a Hamamatsu NanoZoomer 2.0 HT slide-scanner. Each slide was manually reviewed to ensure high scan quality and sufficient diagnostic tumour tissue. Samples with equivocal diagnoses or missing molecular work-up otherwise needed to assign an integrated WHO 2016 diagnosis were excluded. A subset of glioblastoma scans (n = 381) has been published previously as part of the GBMatch study15.

Basic clinical annotations consisting of patient age and sex as well as tumour location and recurrence were acquired from local electronic records where available. Tumour locations have been assigned to the following 19 categories: frontal; parietal; insular; occipital; temporal; cerebellar; brain stem; spinal; lateral ventricle; diencephalon; third ventricle; fourth ventricle; sellar region; cranial nerves; basal ganglia; cerebral, NOS (not otherwise specified); posterior fossa, NOS; cranial, NOS; and other.

This study complies with the relevant ethical, legal and institutional regulations and the study protocol has been approved by the Ethics Committee of the Medical University of Vienna (EK1691–2017). Participant informed consent has been obtained as by institutional guidelines, necessitating restrictions on commercial use of the obtained data.

Estimation of cell density and scanned tissue area

Additionally, the total tissue area and the average cellularities were estimated for each scan using a custom MATLAB script (MATLAB R2017b, MathWorks) with a similar approach as previously published15,16. In summary, H&E stained WSIs were first colour-deconvoluted into separate Haematoxylin and Eosin channels17. Then, global, Phansalkar and Otsu thresholding were applied to the Haematoxylin channel to identify nuclei18,19. Watershedding was used to separate densely clustered cells20. Only cells with a minimum size of 4 pixels were kept. The total tissue area was determined by averaging all colour channels, thresholding at a threshold of 220, followed by binary close and open operations.

Data Records

Data are provided via EBRAINS21 as one ndpi-file per sample, sorted by diagnostic tumour type (in alphabetical order) for easier access. It is possible to download single files directly or all files of a specific tumour type or the whole dataset using a download manager (such as the Chrono Download Manager for the Google Chrome browser). Furthermore, supplementary clinical information, estimated cell densities and scanned tissue area is provided in a csv-spreadsheet with one row per tumour sample. An overview of all spreadsheet variables and descriptions is given in Table 1.

Table 1 Recorded clinical variables and corresponding descriptions.

A total of 3,115 histological slides of 2,880 patients have been scanned. A total of 126 distinct diagnostic tumour types could be included. There are 1,395 female and 1,462 male patients in the dataset. The mean patient age at brain tumour surgery was 45 years, ranging from 9 days to 92 years. 2,530 of the scanned slides originated from primary operations and 538 from re-operations. See online-only Table 1 for descriptive properties broken down by tumour type. Descriptive visualizations of patient age, sex, tumour location, cellularity, and scanned tissue area are given in Fig. 2. Of note, we also scanned exceptionally rare tumour types such as melanotic schwannomas or liponeurocytomas (Fig. 3). A total of 47 non-tumour slides from different non-tumour CNS regions and with different pathologies were included as controls.

Fig. 2
figure 2

Descriptive statistics of the ‘Digital Brain Tumour Atlas’ patient cohort (not including control patients). (a) The age distribution by sex shows a bimodal distribution with most patients belonging to the higher-age categories. Since some uncommon tumour types like medulloblastoma occur mainly in children and have been strategically over-sampled, there is also a peak in younger patients. (b) The distribution of the different WHO grades shows a slight predominance of grade I and grade IV tumours. Of note, some tumour entities are not assigned WHO grades (‘NA’) and very few tumour types are assigned intermediate grades II-III (a total of five cases, not shown in the figure). (c) Tumour distribution with colour-coded locations and ratio-specific circle sizes. (Brain illustration adapted from Patrick J. Lynch, wikimedia) (d) Distribution of the cell densities of all included tumour samples by tumour grade. Note that lower-grade tumours are not necessarily less cell dense (e.g., in the case of cellular schwannoma). (e) The distribution of the scanned tissue areas (per slide).

Fig. 3
figure 3

Exemplary images from exceedingly rare brain tumours, which are included in the DBTA. (a) Perineurioma component of a hybrid nerve sheath tumour. (b) Angiosarcoma. (c) Lymphoplasmacyte-rich meningioma. (d) Crystal-storing histiocytosis. (e) Embryonal tumour with multilayered rosettes. (f) Melanotic schwannoma. (g) Angiocentric glioma. (h) Cerebellar liponeurocytoma. (i) Pituicytoma.

Technical Validation

All cases were initially selected based on the given diagnosis in the diagnostic electronic records. To ensure conformity with the WHO 2016 diagnosis, all slides have been independently reviewed by two neuropathologists experienced in neuro-oncology. In disputed cases, a third senior neuropathologist was consulted. Older cases with missing necessary molecular analyses were not included in the dataset.

Inter- and intraobserver variability is one factor that contributes to misdiagnoses or discrepant diagnoses. We mitigated the risk by including only cases that had already undergone thorough routine diagnostic work-up and were additionally reviewed independently by at least two neuropathologists as described above. In this way, we also ensured excellent image quality and the presence of sufficient diagnostic tumour tissue on each WSI. Scans with suboptimal image quality were either re-scanned (if possible) or excluded.

Usage Notes

Data access

The data can be accessed via EBRAINS21. In order to download the data set, users have to register with EBRAINS and agree to the general terms of use, access policy as well as the data use agreement for pseudonymised human data (https://ebrains.eu/terms). The data are distributed under the conditions that users cite the respective DOI, adhere to EBRAINS’ Data Use Agreement and do not use the data for commercial purposes.

WSI processing

The ndp.view2 (© Hamamatsu) software can be freely used to view and annotate slide scans saved in the ndpi format22. Alternatively, most other WSI programs such as the open-source OMERO software platform23 and the open-source QuPath software24 can work directly on ndpi-files. However, most programming languages and non-specialized image processing software cannot handle ndpi-files out of the box. Thus, we also provide a toolbox of MATLAB scripts that depend on the openslide library25 and can be used to

  1. 1.

    Automatically tile large slide scans and export multiple smaller image patches in a given magnification.

  2. 2.

    Convert annotation-files (.ndpa) to overlays, which can be used to extract specific regions of interest.

  3. 3.

    Estimate the total tissue area on a WSI.

  4. 4.

    Estimate the cell density on a WSI.

Of note, slide thickness and staining intensity vary to some degree, resulting in a slightly different histological appearance of each slide. Thus, for machine learning applications, we recommend astain normalization step such as WSICS26, more recent methods employing generative adversarial networks27 or style transfer learning28. Moreover, heavy stain colour augmentation should be performed29. Of note, the stain normalization step can be omitted with only a negligible drop in performance as has been shown by Tellez et al.29.