PW-BALFC, a clinical dataset for detection and instance segmentation of bronchoalveolar lavage fluid cell

Shi, Xin; Huang, Qing; Xu, Teng; Mei, Hongwen; Quan, Tingwei; Wang, Xiuli; Shi, Yinghan; Hu, Ye; Duan, Zhimei; Xie, Fei; Li, Sifan; Xie, Lixin; Wang, Kaifei

doi:10.1038/s41597-025-05452-4

Download PDF

Data Descriptor
Open access
Published: 01 July 2025

PW-BALFC, a clinical dataset for detection and instance segmentation of bronchoalveolar lavage fluid cell

Xin Shi^1,2^na1,
Qing Huang³^na1,
Teng Xu³,
Hongwen Mei³,
Tingwei Quan⁴,
Xiuli Wang^1,5,
Yinghan Shi¹,
Ye Hu¹,
Zhimei Duan¹,
Fei Xie¹,
Sifan Li^1,5,
Lixin Xie¹ &
…
Kaifei Wang¹

Scientific Data volume 12, Article number: 1074 (2025) Cite this article

2593 Accesses
1 Citations
Metrics details

Subjects

This article has been updated

Abstract

Bronchoalveolar lavage fluid (BALF) cytology provides an important basis for the diagnosis and treatment of lung diseases. Current cytological analysis of BALF relies on manual microscopic examination, which is time-consuming, laborious, and experience-dependent. Automated identification of BALF cytology helps increase the accuracy and speed of screening qualified samples and subsequent cytomorphology analysis. However, there is a lack of public clinical BALF cell datasets for the detection of different cell types and a lack of pixel-level annotations for cytomorphology analysis. In this work, high-resolution cell images from clinical bronchoalveolar lavage sample obtained at the Chinese PLA General Hospital from 2018–2024 were collected, and pixel-level high-quality instance annotations of seven cell types were labeled. In total, 2,105 clinical images were gathered, with 13,263 cells from seven distinct classes, via both contour fine labeling and bounding box labeling. The dataset was trained and tested by the YOLOv8 instance segmentation network. The results demonstrated that the dataset and model we provided are beneficial for the study of automated cell identification in BALF.

Enhanced YOLOv5 network-based object detection (BALFilter Reader) promotes PERFECT filter-enabled liquid biopsy of lung cancer from bronchoalveolar lavage fluid (BALF)

Article Open access 29 September 2023

Single-cell transcriptomic analysis of blood and bronchoalveolar lavage fluid in progressive fibrosing interstitial lung diseases

Article Open access 27 August 2025

Lung CCR6⁻CXCR3⁻ type 2 helper T cells as an indicator of progressive fibrosing interstitial lung diseases

Article Open access 15 November 2022

Background & Summary

Bronchoalveolar lavage fluid (BALF) provides a direct window into lung diseases. BALF analysis plays a pivotal role in the diagnosis and evaluation of interstitial lung disease (ILD), pneumonia, and other respiratory disorders and is globally recognized as a valuable clinical diagnostic tool^1,2. Cytomorphologic analysis and classification of bronchoalveolar lavage fluid (BALF) are crucial for diagnosing lower respiratory tract diseases, monitoring therapeutic responses and predicting patient prognosis. Generally, an increase in specific cell types in the BALF is correlated with particular types of ILD. The lymphocyte proportion typically exceeds 15% in patients with sarcoidosis and hypersensitivity pneumonitis. And if A neutrophil proportion in excess of 3% may indicate infection, aspiration pneumonia, or acute respiratory distress syndrome². Changes in the overall count and proportions of particular types of cells in BALF can be used to predict the risk of clinical lung transplantation rejection³. More importantly, BALF and tracheal aspirates acquired via Microbiological Rapid On-Site Evaluation (M-ROSE)⁴ can aid in the rapid diagnosis of lung infection, confirmation of infection patterns, and timely administration of antimicrobial therapy to improve clinical prognosis⁵, which is critical for ill patients, especially those with severe pneumonia.

There is an urgent need for automated morphological analysis of BALF cells in clinical practice. However, current morphologic analyses of BALF cells rely on the manual identification of different types of cells under a microscope, which is labor intensive and time-consuming. Clinical operators must evaluate numerous slides to determine whether the slides are meet quality standards, on the basis of the ratio of erythrocytes to epithelial cells. Some guidelines and expert consensuses have been published^3,6. While the skill of clinical operators varies greatly across different hospitals and diagnostic institutions, and differences in sample processing and acquisition techniques affect the quality of the slides, including discrepancies in patient positioning, cough status, bronchoscope thickness, suction power, and smear evenness^6,7. In a qualified BALF sample, the proportion of red blood cells is less than 10% without blood, and the proportion of ciliated columnar epithelial cells or squamous epithelial cells should be less than 5% to prevent alveolar samples from being contaminated with bronchial cells^3,6. A previous study⁸ reported that approximately 50% of the samples obtained were unqualified. Another study⁹ had to discard some samples from private and academic teaching hospitals because of substandard BALF quality. In addition, clinical operators must calculate the number and ratio of four types of cells (macrophages, lymphocytes, neutrophils, and eosinophils) manually, as the background, color, shape, size, and number of nuclei vary greatly among different patients.

The development of artificial intelligence (AI), especially deep learning, provides new possibilities for the automatic detection and morphological analysis of BALF cells. The robustness and accuracy of these data-driven AI algorithms depend on the quantity and quality of the dataset and its annotations. Currently, there is a lack of clinically available public datasets of BALF cells and detailed comprehensive labeling of cellular subtypes, which limits the advancement of AI in BALF cell detection and morphological analysis and hinders the development of new methods. Only recently have a few researchers published their work on BALF cell detection via AI methods on private or inaccessible datasets. Tao et al. developed an automated BALF image scanning system and used MaskCNN to detect the seven types of BALF cells (erythrocytes, ciliated columnar epithelial cells, squamous epithelial cells, macrophages, lymphocytes, neutrophils, and eosinophils) in a private dataset with bounding box annotation¹⁰. Wu et al.¹¹ and Rumpf et al.⁹ applied improved YOLOv5s and YOLOv7 algorithms to detect macrophage, lymphoid, neutrophil, and eosinophil BALF cells via a competition dataset (“Bronchoalveolar Lavage Fluid Cell Sorting Count Challenge”, which is not currently available) and a private dataset with bounding box annotation, respectively. Laura et al.¹² employed higher harmonic generation microscopy and deep learning techniques to achieve the differentiation and quantification of leukocytes (neutrophils, eosinophils, lymphocytes, and macrophages) in private BALF samples.

In this paper, we present a large, high-resolution, and pixel-level labeled instance annotation of a clinical public BALF cell dataset, named PW-BALFC (which stands for the PLA-WIT’s BALF cell dataset). To the best of our knowledge, PW-BALFC is the first publicly available dataset dedicated specifically to BALF cell detection and analysis. PW-BALFC is different from other BALF cell datasets: (1) Most other BALF cell datasets focus predominantly on the detection of macrophages, lymphocytes, neutrophils, and eosinophils and lack the ability to detect erythrocytes and epithelial cells. PW-BALFC provides seven clinical types of BALF cells, which can be used for both the automatic screening of qualified samples and the detection of different cell numbers. (2) Other BALF datasets provide only bounding box annotations, but box shapes are difficult to use for some irregular cell morphologies, such as ciliated columnar epithelium and densely overlapping cells. PW-BALFC provides comprehensive annotations, including bounding box annotations and pixel-level high-quality annotations of every cell, which can be applied for precise localization and further morphological analysis of cells. We collected 2,105 clinical images, with 13,263 annotated cells from seven typical cell classes with pixel-level annotations of the PW-BALFC dataset, which can be utilized for instance detection and segmentation of BALF cells to facilitate BALF cell morphology research and pedagogical studies.

Methods

Ethical statement

The study was approved by the Chinese PLA General Hospital (Approval Number: 20220322001). All samples were collected from participants who underwent bedside bronchoscopy during their hospitalization in the Respiratory and Critical Care Unit (RICU) of the Chinese PLA General Hospital. All the participants were evaluated for indications and exclusion of contraindications before undergoing bronchoalveolar lavage with M-ROSE. Participants were adequately informed and provided both verbal and written informed consent. All the images in the dataset were anonymized to protect privacy. This study met the criteria set out in the Declaration of Helsinki.

Data collection

The study samples were derived from patients who underwent bronchoalveolar lavage and endotracheal aspiration between 2018 and 2024. A total of 558 patients were recruited for this study. Among them, 337 were male, and 181 were female. The age distribution of the participants ranged from 14 to 100 years, with a median age of 71 years. The 25^th percentile for age was 57 years, and the 75% for age was 85 years. A total of 485 patients were diagnosed with pulmonary infectious diseases, and another 73 patients were diagnosed with noninfectious pulmonary infectious diseases.

Among patients diagnosed with pulmonary infectious diseases, most (82.3%) had bacterial infections, and others had opportunistic pulmonary infections (7.0%), viral infections (4.7%), fungal infections (3.1%), and atypical pathogen infections (2.9%). A total of 143 patients were diagnosed with severe pulmonary infections as defined by relevant clinical criteria. For patients diagnosed with noninfectious pulmonary infectious diseases, most of them had interstitial lung diseases (25 patients), others had pulmonary neoplasms, bronchiectasis, pulmonary hemorrhagic disorders, pulmonary edema and so on.

Sample preparation included fluid extraction, centrifugation, smearing, and staining. After centrifugation of the alveolar lavage fluid, approximately 0.3 ml of the sample was retained and mixed, and approximately 20 μl was selected for a 1-cm²smear on a slide. Then, Diff-Quick or Gram staining was performed after the samples were dried. The slides were sequentially submerged in liquid A (10–30 s), rinsed in phosphate buffer solution (PBS), immersed in liquid B (20–40 s), cleaned in water, and dried for microscopic examination for Diff-Quick staining¹³. After Diff-Quick staining, the cell nucleus exhibited a purplish-red hue, and the cytoplasm displayed a bluish-purple color⁷. For Gram staining, the slides were initially stained with crystal violet, then stained with iodine solution, decolorated with alcohol, and re-stained with a safranin staining solution. After Gram staining, the cells tended to be pink. Different types of BALF cells exhibit various morphological characteristics based on the different staining of the nucleus and cytoplasm.

After the sample slides were prepared, images were collected via an ImageView optical microscope (model: OLYMPUS CX31). The clinical operator first observed the whole film at low magnification and then selected the area of interest, which was clearly stained and had a uniform cell distribution. Later, the areas of interest realized an overall magnification of 100 × by transferring to high magnification with a 100 × objective lens (numerical aperture (NA) = 0.9, working distance of 0.37 mm). All the images in the dataset were acquired via an OLY Cam DP1000 camera with pixel dimensions of 4912 × 3684 and a pixel size of 1.85 µm × 1.85 µm. Patients may undergo several bedside bronchoscopy procedures for M-ROSE examination during the disease course, and each procedure may involve several samples; therefore, multiple images may be captured from a sample.

Data annotation

A total of 6,757 images were collected from the Chinese PLA General Hospital, and only 1,940 images remained for labeling after some poor-quality and cell-free images were manually excluded. We used the Label Studio software platform (available at https://labelstud.io/) for annotation. The images were first resampled to a size of 853 × 640 for loading and smooth labeling. Then, we determined the classes of each cell and used different colors to outline the contours using the way of Mask meticulously. In this way, each pixel is classified into 8 classes (7 classes of cells and 1 background class). The final segmentation results were saved in the YOLO instance detection and segmentation format. As this format preserves the ratio of contour length to width for each cell, the labeling result can be used for high-resolution images directly. This fine contour labeling method enables the precise identification of various cell sizes and morphologies. To guarantee the annotation quality, a two-stage collaborative annotation process was implemented. First, two experienced annotators annotated the image via cross-labeling. Two senior clinical cytologists subsequently examined and corrected each annotation of the labeled image. This two-level collaborative annotation process was repeated twice until the annotation results were undisputed. The final number of validly labeled cells was 12,267. Figure 1 presents the typical morphology of the seven types of BALF cells and their corresponding annotation results.

Data Records

This PW-BALFC dataset¹⁴ is publicly available through the open source platform https://doi.org/10.5281/zenodo.14871206. To provide high-resolution and clear visualizations of biomedical images, this dataset¹⁴ consists of four types: high-resolution original images, resampled images that can be used directly for network training (users can also resample on demand), corresponding visualization images to show manually labeled contours on each cell (as shown in Fig. 1, different cell types are represented in different colors), and labels. The labels are saved in the standardized.txt format of YOLO instance segmentation. The first value of each row represents the type of marked cell. The subsequent values are the polygon coordinate values of the outer cell contour, which are the normalized length and width ratio of the image. The annotation format allows direct use of YOLO series models for detection and instance segmentation tasks and can also be easily converted to other data formats.

Table 1 displays the basic statistical analysis of our data. The images are categorized into 8 classes, including ciliated columnar epithelial cells (CEC), red blood cells (RBC), squamous epithelial cells (SEC), eosinophil cells (EC), neutrophil cells (NC), lymphocyte cells (LC), macrophages (MC), and background. The initial number of images collected was 1940, and the total number of cells was 12,267. Since the majority of patients in the intensive care unit are diagnosed with severe infections, samples from these patients have high neutrophil counts and low eosinophil and lymphocyte counts. Moreover, red blood cells are often more densely distributed in clusters than other cells are, resulting in an increased count. The first row in Table 1 presents the number of each cell type, and the second row shows the proportion of each cell type in the total cell number. The diversity of the dataset is widely distributed. The background color, noise (including impurities, etc.), staining color, number, size, shape and distribution of different cells varied greatly among the dataset. The irregular shapes of some of the cells or nuclei and the densely distributed cells also increase the diversity and difficulty of the data.

Table 1 Distribution of different cells in the initial PW-BALFC dataset.

Full size table

As shown in Table 1, eosinophils accounted for less than 1% of the original data, and the cell size and morphology were similar to those of the neutrophils. To increase the accuracy of the subsequent detection and segmentation of eosinophils, we performed offline data augmentation of eosinophils. The augmentation included random selections of brightness, color, contrast, and sharpness adjustments via the ImageEnhance function of the Python library. Table 2 shows the distributions of different cell types after data augmentation. After augmentation, the total number of images reached 2105, and the total number of labeled cells was 13,263. The proportion of eosinophils in the total data increased to 2.38%.

Table 2 Distribution of different cell types in the PW-BALFC dataset via offline data augmentation.

Full size table

Deep learning for BALF cell segmentation

We applied the classical YOLOv8 deep learning instance segmentation model to detect and segment the seven types of BALF cells simultaneously. The labeled dataset was divided into a training set and a test set at a ratio of 9:1. The data in the training set include images augmented with color transformations, whereas the eosinophils in the test set have not undergone color augmentation. The YOLOv8-m model was selected due to the number of images. Online data augmentation strategies, including rotation, flipping, scaling, panning, mixup, copy paste, and mosaic, are also used to increase the robustness of the network. The segmentation loss function is composed of object bounding box loss, classification loss, DEL loss, and mask segmentation loss. The best model was selected based on the weighted values of precision and recall.

Instance detection and segmentation results

Table 3 shows the specific results of the deep learning YOLOv8 algorithm in our multiclass cell instance detection and segmentation. Four classical segmentation criteria were used to evaluate the performance of the model in each category (ciliated columnar epithelial cells, red blood cells, squamous epithelial cells, eosinophils, neutrophils, lymphocytes, and macrophages). These metrics are precision, recall, Micro-F1 score, and mAP50, which can be used to evaluate the model comprehensively and holistically from multiple perspectives. The micro-F1 score computes global precision and recall by aggregating the prediction results across all categories, and it is more suitable for addressing datasets with imbalanced category distributions. The intersection over union (IoU) threshold was set to 0.68 to select the correct samples for instance detection of the bounding box and instance segmentation of the mask. We also used the confusion matrix to show the prediction results of the model for different cell types. Figure 2 shows the prediction results of this model on the test images.

Table 3 Detection and segmentation evaluation of different cells in the PW-BALFC dataset.

Full size table

The model achieved high precision (0.85–1.00) for instance detection and segmentation among all the cell types, with an average accuracy of 0.902. The model achieved good results on squamous epithelial cells (precision 0.934) and eosinophils (precision 1.00). The size of red blood cells is quite small, and their instance segmentation precision was lower than that of the other cell types (0.85). Some of the cells in the image were deeply stained with a dark color, some nuclei were clustered and difficult to distinguish, and some cell types exhibited similar sizes and shapes. These hard cases also confused annotators and led to errors in model detection and segmentation (as shown in Fig. 2e). The model may misidentify few red blood cells and lymphocytes, as they are similar in size and deeply stained. With respect to recall, the recall of eosinophils was not high, as their cell number was too small for the model to detect. The recall of the ciliated columnar epithelial cells was also not quite high (0.775), as they had highly irregular morphologies compared with those of other cells, and sometimes their cilia were difficult to visualize, given them an appearance similar to macrophages and lymphocytes. The recall values of the other cells were high, ranging from 0.864 to 0.922, with an average level of 0.828. The Micro-F1 score, which is a comprehensive index that balances the metrics of precision and recall, achieved a high metric of 0.801 for the seven cell types. Except for eosinophils and ciliated columnar epithelial cells, the average Micro-F1 score for the five cell types reached above 0.87. The mAP50 metric, which is another comprehensive metric used to assess the instance detection model, reflects the performance of the model at different confidence thresholds. Our model achieved an average metric of 0.884 for the seven cell types. Excluding eosinophils, which had a very low cell count proportion, the mAP50 metrics for the remaining cell types were all above 0.9, ranging from 0.905 to 0.958. The confusion matrix of our model on the test dataset is presented in Fig. 2d.

Overall, the model’s general accuracy in instance detection and segmentation prediction for seven types of BALF cell data is commendable. The observed misclassification of some cells also indicates that the model needs further refinement. Strategies including more data augmentation, hyperparameter adjustment of the model, and postprocessing operations such as handling edge cases where cells are difficult to distinguish and performing secondary classification on cells prone to misidentification could further increase the accuracy of the model.

Technical Validation

To the best of our knowledge, this study presents the first publicly available clinical dataset¹⁴ that provides pixel-level instance annotations for seven cell types in high-resolution BALF images. Unlike previous studies that focused on classifying and counting four BALF cell types, our dataset¹⁴ contains ciliated columnar epithelial cells, red blood cells, and squamous epithelial cells whose proportions in the samples can be used to determine whether the samples are qualified. This dataset facilitates the automated screening of qualified samples by the above advantage, preventing erroneous laboratory diagnoses caused by analyzing suboptimal BALF samples and thereby reducing the risk of misdiagnosis or missed diagnosis and minimizing resource waste.

Owing to the variability of artificial staining and smearing, as well as the influence of different patients’ disease states, there is diversity in the background of the image and the morphology of the same cells, including different shades of staining, overlapping of multiple cells, and different morphologies of the same cells. Zhu et al. reported that SARS-CoV-2-infected ciliated cells shed their cilia^15,16. Previous studies have shown that viral infections can trigger cytopathic effects (CPEs), characterized by cellular swelling, cytoplasmic vacuolization, degeneration and disintegration of ciliated cells, and nuclear chromatin marginalization. In addition, patients with prolonged exposure to dust and smoke exhibit macrophages containing thickened, increased, and unevenly distributed granules. In cases of lipoid pneumonia, macrophages often display a foamy cytoplasmic appearance. Additionally, binucleated and multinucleated macrophages are frequently observed in patients with sarcoidosis and tuberculosis¹⁷. In our dataset¹⁴, the implementation of contour fine labeling supports precise morphological analysis of BALF cells, allowing cells to be distinguished by size and accurately identified based on characteristic cell features. This approach, which can better accommodate the complex and variable annotation requirements of real-world clinical samples, enhances diagnostic precision and improves the detection of cells that exhibit irregular and diverse morphological variations. Moreover, by comparing the distinct morphological characteristics of the same cell type, the disease status of a patient can be accurately identified.

Eosinophils from BALF account for a small percentage and exhibit morphological similarities to neutrophils. To address the data discrepancy arising from extreme category imbalance and improve the accuracy of eosinophil segmentation and detection, an offline data augmentation strategy was applied to the eosinophils in this dataset prior to their input into deep learning training. Online augmentations, including rotation, flipping, scaling, translation, and mixup, were performed alongside other cell category datasets in the subsequent YOLO mode to increase the robustness of the model.

Annotation validation

The annotation process was carried out via a two-round, two-level labeling approach. After initial annotation through cross-labeling between two experienced annotators, the annotations were reviewed and refined by two senior clinical cytologists. In the event of any disagreement regarding the cell annotation results between initial annotators and clinical cytologist who has over 2 years of experience, the clinical cytologist with over 15 years of experience could determine the final annotated results. In addition, patient-identifiable information, including name, admission diagnosis, and clinical condition, was anonymized prior to data labeling. Therefore, both annotators and clinical cytologists focused on annotating individual images to prevent erroneous annotations caused by subjective judgment. This annotation procedure was repeated twice to ensure a high degree of accuracy.

The confusion matrix shown in Fig. 2d presents the prediction results for the identification of different cell types and indicates that the proportions of true data predicted to fall into each category were all above 0.83, with the exception of eosinophils. The mistaken recognition distribution was concluded in the following aspects: (1) a few errors in identifying neutrophils and eosinophils and (2) some red blood cells with smaller sizes and clustered distributions were confused with the background. These phenomena were caused by similar cell morphologies and deeply stained slides, which also presented challenges for experienced clinical cytologists. According to the above mentioned, the number of inconsistent discriminants in our annotated dataset is minimal, which is acceptable considering the complexity of the clinical data and the variability within the sample.

Influence of the augmentation process

Offline data augmentation of eosinophils increased the percentage of original cells from 0.69% to 2.38%, which helped to increase the robustness of the deep learning model. YOLOv8, which performs instance detection and segmentation, was then used for deep learning training and testing on cellular data. The annotated dataset was split into training and test datasets at a ratio of 9:1, with the model being trained on the augmented dataset, while the test set excluded the augmented offline images. The instance detection and segmentation results described in the previous section demonstrate that the network model trained in this paper on an augmented dataset enables the identification of the vast majority of cells in BALF and provides accurate classification.

Usage Notes

This dataset¹⁴ can be used to train and explore different AI models for the automatic detection and morphological analysis of BALF cells, enabling the rapid diagnosis of various lung diseases. The dataset is particularly suitable for the rapid screening of pulmonary infectious diseases and promotes timely targeted therapy.

Limitations

The dataset¹⁴ focused on pulmonary infections, which may affect the characteristics of the dataset (including neutrophilia >50%, nuclear left shift and hypersegmented nuclei^18,19), reducing accuracy in automatically identifying nondominant cell types (such as eosinophils). The acquisition of the dataset requires strict specimen processing timelines (BALF fixation/staining within hours to prevent morphological degradation¹⁹) in the overall clinical flow. We followed a standardized specimen collection process to prevent changes such as apoptosis or lysis in cell morphology due to delayed sample processing. Since our dataset included noninfectious lung diseases such as interstitial lung diseases, lung tumors, and allergic pneumonia, and we used contour fine labeling, both of these helped mitigate bias and enabled AI models to make robust predictions.

Code availability

The complete codes for our study are publicly available on GitHub at https://github.com/shixin0927/Clinical-Dataset-Of-Bronchoalveolar-Lavage-Fluid-Cell/tree/master. This repository contains the code of the instance detection and segmentation network, which can be directly used and explored by other researchers. The repository also contains the code for training and evaluating our neural network.

Change history

18 July 2025
In this article the Supplementary File was published in error and has now been removed. The original article has been corrected.

References

Davidson, K. R., Ha, D. M., Schwarz, M. I. & Chan, E. D. Bronchoalveolar lavage as a diagnostic procedure: a review of known cellular and molecular findings in various lung diseases. J Thorac Dis 12, 4991–5019, https://doi.org/10.21037/jtd-20-651 (2020).
Article PubMed PubMed Central Google Scholar
Sindhu, A., Jadhav, U., Ghewade, B., Wagh, P. & Yadav, P. Unveiling the Diagnostic Potential: A Comprehensive Review of Bronchoalveolar Lavage in Interstitial Lung Disease. Cureus 16, e52793, https://doi.org/10.7759/cureus.52793 (2024).
Article PubMed PubMed Central Google Scholar
Zhou, D. et al. Consensus of Chinese Experts on Morphological Examination of Bronchoalveolar Lavage Fluid Cells (2020). J Mod Lab MeD 35, 4–8 (2020).
Google Scholar
Wang, X. et al. A Clinical Bacterial Dataset for Deep Learning in Microbiological Rapid On-Site Evaluation. Sci Data 11, 608, https://doi.org/10.1038/s41597-024-03370-5 (2024).
Article PubMed PubMed Central Google Scholar
Craven, V., Hausdorff, W. P. & Everard, M. L. High levels of inherent variability in microbiological assessment of bronchoalveolar lavage samples from children with persistent bacterial bronchitis and healthy controls. Pediatr Pulmonol 55, 3209–3214, https://doi.org/10.1002/ppul.25067 (2020).
Article PubMed Google Scholar
Stanzel, F. in Principles and Practice of Interventional Pulmonology (eds. Ernst, A. & Herth, F. J.) Ch. 16, 165–176 (Springer New York, 2013).
Baughman, R. P. Technical aspects of bronchoalveolar lavage: recommendations for a standard procedure. Semin Respir Crit Care Med 28, 475–485, https://doi.org/10.1055/s-2007-991520 (2007).
Article PubMed Google Scholar
Baughman, R. P., Spencer, R. E., Kleykamp, B. O., Rashkin, M. C. & Douthit, M. M. Ventilator associated pneumonia: quality of nonbronchoscopic bronchoalveolar lavage sample affects diagnostic yield. Eur Respir J 16, 1152–1157, https://doi.org/10.1034/j.1399-3003.2000.16f23.x (2000).
Article CAS PubMed Google Scholar
Rumpf, S., Zufall, N., Rumpf, F. & Gschwendtner, A. A Performance Comparison of Different YOLOv7 Networks for High-Accuracy Cell Classification in Bronchoalveolar Lavage Fluid Utilising the Adam Optimiser and Label Smoothing. J Imaging Inform Med https://doi.org/10.1007/s10278-024-01315-3 (2024).
Article PubMed Google Scholar
Tao, Y. et al. Automated interpretation and analysis of bronchoalveolar lavage fluid. Int J Med Inform 157, 104638, https://doi.org/10.1016/j.ijmedinf.2021.104638 (2022).
Article PubMed Google Scholar
Wu, P. et al. An improved Yolov5s based on transformer backbone network for detection and classification of bronchoalveolar lavage cells. Comput Struct Biotechnol J 21, 2985–3001, https://doi.org/10.1016/j.csbj.2023.05.008 (2023).
Article CAS PubMed PubMed Central Google Scholar
van Huizen, L. M. G. et al. Leukocyte differentiation in bronchoalveolar lavage fluids using higher harmonic generation microscopy and deep learning. PLoS One 18, e0279525, https://doi.org/10.1371/journal.pone.0279525 (2023).
Article CAS PubMed PubMed Central Google Scholar
Weng, X., Sun, W., Luo, Z., Zhou, Y. & An, X. Microbiological Rapid On-Site Evaluation for Pulmonary Infectious Diseases. J Vis Exp https://doi.org/10.3791/66059 (2024).
Article PubMed Google Scholar
Shi, X. & Huang, Q. Clinical Dataset Of Bronchoalveolar Lavage Fluid Cell. Zenodo. https://doi.org/10.5281/zenodo.14871206 (2025).
Bridges, J. P., Vladar, E. K., Huang, H. & Mason, R. J. Respiratory epithelial cell responses to SARS-CoV-2 in COVID-19. Thorax 77, 203–209, https://doi.org/10.1136/thoraxjnl-2021-217561 (2022).
Article PubMed Google Scholar
Zhu, N. et al. Morphogenesis and cytopathic effect of SARS-CoV-2 infection in human airway epithelial cells. Nat Commun 11, 3910, https://doi.org/10.1038/s41467-020-17796-z (2020).
Article ADS CAS PubMed PubMed Central Google Scholar
Zhou, D., Tang, G. & Liu, S. Cytology Atlas andLaboratory Diagnostic Cases of Bronchoalveolar Lavage Fluid. (Shanghai Science and Technology Press, 2022).
Zhang, W. et al. The clinical value of hematological neutrophil and monocyte parameters in the diagnosis and identification of sepsis. Ann Transl Med 9, 1680, https://doi.org/10.21037/atm-21-5639 (2021).
Article CAS PubMed PubMed Central Google Scholar
Meyer, K. C. et al. An official American Thoracic Society clinical practice guideline: the clinical utility of bronchoalveolar lavage cellular analysis in interstitial lung disease. Am J Respir Crit Care Med 185, 1004–1014, https://doi.org/10.1164/rccm.201202-0320ST (2012).
Article PubMed Google Scholar

Download references

Acknowledgements

This research was supported by the Capitals Funds for Health Improvement and Research (CFH) under Grant 2022-1-5091 and project N20240194.

Author information

These authors contributed equally: Xin Shi, Qing Huang.

Authors and Affiliations

College of Pulmonary and Critical Care Medicine, Chinese PLA General Hospital, Beijing, China
Xin Shi, Xiuli Wang, Yinghan Shi, Ye Hu, Zhimei Duan, Fei Xie, Sifan Li, Lixin Xie & Kaifei Wang
Chinese PLA Medical School, Beijing, China
Xin Shi
School of Computer Science & Engineering Artificial Intelligence, Hubei Key Laboratory of Intelligent Robotics, Wuhan Institute of Technology, Wuhan, Hubei, 430205, China
Qing Huang, Teng Xu & Hongwen Mei
Britton Chance Center for Biomedical Photonics, Wuhan National Laboratory for Optoelectronics, Huazhong University of Science and Technology, Wuhan, Hubei, 430074, China
Tingwei Quan
School of Medicine, Nankai University, Tianjin, 300071, China
Xiuli Wang & Sifan Li

Authors

Xin Shi
View author publications
Search author on:PubMed Google Scholar
Qing Huang
View author publications
Search author on:PubMed Google Scholar
Teng Xu
View author publications
Search author on:PubMed Google Scholar
Hongwen Mei
View author publications
Search author on:PubMed Google Scholar
Tingwei Quan
View author publications
Search author on:PubMed Google Scholar
Xiuli Wang
View author publications
Search author on:PubMed Google Scholar
Yinghan Shi
View author publications
Search author on:PubMed Google Scholar
Ye Hu
View author publications
Search author on:PubMed Google Scholar
Zhimei Duan
View author publications
Search author on:PubMed Google Scholar
Fei Xie
View author publications
Search author on:PubMed Google Scholar
Sifan Li
View author publications
Search author on:PubMed Google Scholar
Lixin Xie
View author publications
Search author on:PubMed Google Scholar
Kaifei Wang
View author publications
Search author on:PubMed Google Scholar

Contributions

X. Shi contributed developed the research plan, acquired all images, verified the data annotation, analyzed the results, and drafted the manuscript. Q. Huang verified the data annotation, processed and analyzed the data, developed the deep learning network, discussed the results, and drafted the manuscript. T. Xu, H. Mei, and T. Quan processed and annotated the data. X. Wang, Y. Shi, Y. Hu, Z. Duan, F. Xie, and S. Li collected the clinical samples. K. Wang and L. Xie designed the research, supervised the project, verified the data annotation, and reviewed the manuscript.

Corresponding authors

Correspondence to Lixin Xie or Kaifei Wang.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Shi, X., Huang, Q., Xu, T. et al. PW-BALFC, a clinical dataset for detection and instance segmentation of bronchoalveolar lavage fluid cell. Sci Data 12, 1074 (2025). https://doi.org/10.1038/s41597-025-05452-4

Download citation

Received: 26 February 2025
Accepted: 24 June 2025
Published: 01 July 2025
Version of record: 01 July 2025
DOI: https://doi.org/10.1038/s41597-025-05452-4