Table 1 Statistics comparison of existing datasets and our Broncho-R dataset, including the dataset name, dataset source, number of samples, and multiple sub-task involvement.

From: Towards Automated Reporting: A Bronchoscopy Report Dataset for Enhancing Multimodality Large Language Models

Dataset Name

Data Source

Patient Numbers

Images

Sub-Task

BronchoLC11

Public

208

2,921

CLS, SEG

UAAL12

Public

3,814

CLS, SEG

B12K14

Public

615

2,900

CLS

PKDN13

Private

200

2,029

CLS

Ours

Public

3,692

6,330

CAP, CLS

  1. In general, the main shortcomings of existing datasets include the following: (i) there is a lack of comprehensive and evenly distributed departmental coverage, as well as a sufficient number of patients, to prevent bias; (ii) data sources are often private and inaccessible; (iii) existing datasets usually only have single-task annotation. “CLS” refers to classification task, “SEG” refers to segmentation task, and “CAP” refers to caption task.