Introduction

Artificial intelligence (AI) tools in medicine aim to improve patient care by learning patterns across massive, well-annotated datasets. These annotations, describing key patient characteristics, are often derived from clinical notes, such as pathology and radiology reports. Extracting ground-truth labels from clinical notes remains a critical research bottleneck. As a result, our study focuses on research database curation, namely the systematic process of turning massive collections of clinical notes into clean tabular datasets. The traditional approach to database curation is manual chart review, where researchers annotate thousands of reports. If the research team wishes to add a new variable to their database, a new round of exhaustive annotation is required. It is intractable to curate health-system scale research databases with retrospective manual annotation. There is an unmet need for scalable and human-level data extraction from clinical reports with minimal engineering overhead.

Automated tools, including deep learning1,2, statistical machine learning3,4,5,6, and rule-based7,8,9 methods, have been proposed to extract structured information from clinical reports. While these systems have achieved promising performance, they require substantial effort to develop and deploy. As a result, their adoption by the research community remains limited. Large language models (LLMs) have demonstrated exciting capabilities in general text understanding, detecting errors in radiology reports10, streamlining radiology report impressions11,12, and automatic synoptic reports13. Recent works14,15 present powerful methods for data-scarce few-shot learning, particularly in relation extraction and named entity recognition tasks. However, they require non-trivial model customization and pretraining strategies. Ideally, researchers could structure information extraction as a question-answering task using a shared, minimal-engineering setup. Commercial LLMs offer promising capabilities, yet they are ill-suited for health-system-scale research database curation due to reproducibility, cost, and privacy concerns. Commercial LLMs are continually updated, and if new versions of the LLM perform unexpectedly worse given the same inputs, it can be infeasible to reproduce prior extractions after model updates. Furthermore, costs associated with using commercial LLMs can pose significant barriers, and institutional agreements are often required to address privacy concerns. Ideally, researchers could locally host their LLMs (“self-hosting”) on cheap “desktop-grade” hardware; these local LLMs would enable reproducible research while minimizing costs and avoiding the transfer of sensitive data.

Open-source LLMs, such as Llama16,17,18 and Mistral19, offer a promising solution for efficient “self-hosted” information extraction. Small LLM variants (i.e., < 11 billion parameters) with instruction-tuning allow for efficient general text understanding and cheap customization. Recent medically fine-tuned LLMs, including PMC-Llama 13B and Lama-3-8B-UltraMedical, have demonstrated compelling performance on biomedical question-answering benchmarks20,21. Small versions of the DeepSeek-R1 model have demonstrated strong general reasoning performance while building on small-scale general LLM backbones22, Llama-8B and Qwen-2.5-7B. While these LLMs offer exciting performance across general benchmarks, it remains unclear how these tools perform in clinical information extraction.

In this study, we focus on empowering individual researchers to curate large-scale research databases at their institution, using minimal computing (i.e., a few hours on a single GPU) and annotation resources (i.e., < 100 reports annotated). Our strategy aligns with the practical constraints and foundational research workflows of academic medical centers. Following this strategy, we do not develop massive 400B + open-source LLMs that require multiple H100 servers to host or generic universal LLMs that can generalize across institutions. We hypothesized that with simple fine-tuning techniques, small-scale open-source LLMs could enable individual researchers to create high-quality datasets customized to their institution, empowering their downstream AI research. Our extraction quality goal is “human-level performance”; specifically, we test the ability of LLMs to achieve statistically non-inferior performance to a medical student extracting the same information from clinical notes. The comparison to medical students is not meant to suggest equivalence to specialist performance, but rather to assess whether LLMs can match the standard level of accuracy currently used to curate large-scale research databases. We benchmark a wide range of “self-hosted” LLM approaches on four diverse real-world datasets from ongoing AI research projects at our institutions. To facilitate the broader adoption of customized open-source LLMs, we release a low-code library, Strata, that supports the fine-tuning, evaluation, and deployment of LLMs for clinical information extraction (Fig. 1).

Fig. 1
Fig. 1
Full size image

Dataset construction flowchart for our breast (a), kidney (b), prostate (c), and MDS (d) datasets. UCSF = University of California, San Francisco, VUMC = Vanderbilt University Medical Center, EHR = electronic health records, MDS = Myelodysplastic Syndrome.

Materials and methods

We collected four diverse clinical text datasets underlying active research projects at our institutions across breast, kidney, prostate, and bone marrow cancer (Myelodysplastic Syndrome, MDS). These real-world datasets represent our intended use cases, where LLM-based tools empower the automatic creation of large-scale structured datasets to support downstream research at a single medical center. Our breast, kidney, and prostate datasets were annotated by medical students, and our MDS dataset was annotated by a hematologist. The medical students were specifically trained for the task by attending physicians with 8 to 11 years of post-graduate clinical experience. We describe each dataset in detail, including its creation and annotation process, in Appendix A1 of our supplementary methods.

Extracting structured information using customized large Language models

State-of-the-art LLMs are trained to follow instructions specified in text prompts when generating their answers. This process, called instruction-tuning23, enables users to define tasks with instruction prompts. To leverage instruction-tuned LLMs for clinical information extraction, a user can write out the desired task (e.g., extract the cancer subtype from a pathology report) in a text prompt before inputting the full clinical report. Given the prompt and the clinical report, the LLM will generate an answer in the text format, which can be parsed into the desired structured format using a simple post-processing script. This process is illustrated in Fig. 2.

Fig. 2
Fig. 2
Full size image

Process for leveraging open-source LLMs to extract structured information from clinical reports. The input to the LLM consists of a user prompt, which specifies the task (e.g., cancer subtyping), paired with the report text. An open-source LLM contains pre-trained weights (e.g., 8B parameters for Lamma-3.1 8B) learned across web-scale datasets; when fine-tuning an LLM, we introduce additional LoRA weights (~ 400 M parameters) to customize the model. The model generates an answer in text, parsed by a user-defined “parse function” into a tabular form.

In this work, we are interested in customizing local LLMs for clinical information extraction. To enable the full research community to benefit from this resource, we focus on LLMs that can be easily hosted on a single consumer desktop-grade GPU, such as the Nvidia RTX 3090 or 4090. As a result, we leverage three classes of small LLMs: (1) medically fine-tuned QA models, including PMC-LLaMA 13B and Llama-3-8B-UltraMedical, (2) accessible versions of leading reasoning models, including DeepSeek R1 distilled to Llama-3.1 8B and Qwen2.5-Math 7B, and (3) popular base instruction-tuned models, such as Llama-3.1 8B, Gemma-2-9B, and Mistral-v0.3 7B. These billion-parameter LLMs are orders of magnitude smaller than state-of-the-art LLMs such as Llama-3.1 405B and GPT-4. To enable smaller LLMs to achieve human-level performance in clinical information extraction, we fine-tune them. Specifically, we leverage simple preprocessing scripts to convert human-annotated structured labels (i.e., in a tabular format) to text answers for the LLM to generate (e.g., “My answer is: {“DCIS”: “yes”, “ILC”: “no”}”). Given a dataset of pairs of clinical reports and answer texts, we can train the LLM to correctly predict the intended answer from the user prompt and clinical report. This process is illustrated in Fig. 3. To fine-tune LLMs using desktop-grade hardware, we employ Low-Rank Adaption (LoRA)24. Instead of directly optimizing the billions of parameters in our LLMs, LoRA keeps the initial weights of the LLM frozen and adds a small number of new parameters (e.g., < 400 M parameters) across the LLM to customize the model’s behavior.

Fig. 3
Fig. 3
Full size image

Process of using Strata to create research databases. The researcher labels a subset of the reports, defines prompts, and creates parsing functions to interpret the model’s response. Given this information, Strata supports LLM fine-tuning and deployment. In LoRA fine-tuning, the model’s LoRA weights are trained to generate the templated answers. Strata supports model evaluation and efficient bulk-inference. Bulk-inference is used to create a research database from all unlabeled reports at the institution.

To enable the broader clinical and research communities to fine-tune and deploy their LLMs on local hardware, we developed Strata, a lightweight library for clinical information extraction. As illustrated in Fig. 3, Strata is used in three phases. First, researchers annotate a small dataset of clinical reports with their desired structured labels, which are partitioned into a training, development, and testing set. In addition to these annotations, researchers define the task for the LLM by specifying a prompt, a parsing script, and a template function. As described above, the parsing script maps the LLM’s answer to the desired structured format, and the template function maps the structured annotation to a text-based answer. Given an annotated dataset, Strata supports LoRA-based fine-tuning and flexible hyperparameter optimization. Finally, Strata offers a rich automated evaluation suite for model selection. We leverage Strata across all our experiments.

LLM development and evaluation

Across all datasets, we developed prompts, parsing scripts, and template functions to maximize the performance of our open-source LLMs on the development sets. In these experiments, the models were used without fine-tuning (i.e. zero-shot). For the DeepSeek-R1 models, we modified the prompts to allow for additional reasoning preceding the final answer, as shown in Supplementary Table S1. For evaluation, we ensure that responses are complete in Supplementary Table S2, then explicitly excluded the reasoning content between the “Think:” and “Answer:” tags and used only the final output in our accuracy calculations. This ensured that DeepSeek was evaluated on the same basis as other models and avoided penalizing models for verbosity or non-standard formatting in intermediate steps. Given a final set of prompts and scripts, we used LoRA to fine-tune our best-performing non-reasoning model, Llama-3.1. We show detailed examples of templates and responses in Appendix A2.

LLM fine-tuning requires selecting multiple hyperparameters, which leads to a variety of design choices on model architecture, data augmentation, and training recipes. LoRA includes two hyperparameters, namely rank and alpha. LoRA rank determines the size of the trainable parameters in fine-tuning, and alpha scales the output of these trainable layers relative to the original model’s outputs. In the context of the MDS dataset, we also experimented with class balancing, where rare values are oversampled during training. Finally, we also tuned the number of training epochs and the learning rate of the optimizer. Using Strata, we performed a grid search over possible hyperparameters in mixed-dataset fine-tuning on the development set. We applied the optimal hyperparameters in all other settings.

We considered three possible variants of LLM fine-tuning. First, we trained a customized LLM for each task within each dataset separately (i.e., question-specific fine-tuning). Next, we explored dataset-specific fine-tuning, in which all tasks within a single dataset (e.g., specimen site, cancer presence, and subtyping) were included in fine-tuning a model. Finally, we also considered training a single fine-tuned LLM across all datasets and tasks (i.e., mixed-dataset fine-tuning).

We leverage the annotations of our first human annotator as our reference labels in all evaluations. Our primary evaluation metric is “exact-match accuracy,” where we evaluate the ability of the LLM to extract all components of user-specified queries correctly. For instance, if the reference (i.e., first human annotator) response for breast cancer subtyping were “{DCIS: Yes, ILC: No, IDC: Yes},” the LLM predictions would have to match across all fields to gain a point in our cancer subtyping exact-match accuracy calculation. For dataset-level exact match accuracies, the LLM must correctly extract the information across all user queries for that clinical report. In addition to our stringent exact match accuracies, we also evaluate average accuracies and macro-F1, precision, and recall scores in our Supplementary Tables S3, S4, and S5.

We compare our open-source LLM results to GPT-4 and a second human annotator. All LLMs and the second human annotator were given the same prompts, and their extractions were compared to the first human annotator extractions as reference labels, using the metrics described above. The accuracy of our second human annotator serves as our benchmark for “human-level performance”. We hypothesize that fine-tuned LLMs would be non-inferior to the second human annotator in clinical information extraction. Our goal is not to outperform the second human annotator in matching the first human annotator; we expect disagreements between human annotators to reflect ambiguity in clinical reports. The second human annotator for each test set was a medical student who received training on the relevant pathology classifications and the specific criteria for annotation for each dataset. Both annotators receive appropriate training, so each of their performance individually is on par with what we would expect in real-world database curation. We compare the labels of the two readers using Cohen’s kappa coefficient and inter-annotator agreement rates.

To understand the dependence of fine-tuned LLM performance on labeled data, we performed learning curve analyses. We randomly sampled subsets of each training set (consisting of 25% and 50% of the full training data) and fine-tuned using only the subsets at the dataset-specific level. We then evaluated the model performance with different subsets against the second human annotator to determine the number of labeled reports needed for human-level performance.

To understand the effect of hyper-parameters on our performance, we performed a sensitivity analysis. For each hyperparameter (i.e., LoRA rank, LoRA scaling, number of epochs, and learning rate), we varied one hyperparameter value while holding the others constant. For all fine-tuning experiments, we used a linear learning rate scheduler with a warmup of 5 steps. The learning rate increases during the warmup phase and then decays linearly over the remainder of training. We did not perform additional tuning of the learning rate schedule. This supports our broader aim of evaluating how well models perform under minimal configuration. In particular, this setup offers a robust and practical starting point for low-code workflows, where researchers can fine-tune models with minimal effort and domain-specific engineering. We evaluated the difference in exact match accuracy.

Statistical analysis

We used scikit-learn (version 1.5.1; scikit-learn.org) and SciPy (version 1.14.1; scipy.org) to compute all evaluation metrics, including F1-score, precision, and recall. We computed evaluation metrics across 5000 bootstrap samples (16) to obtain 95% confidence intervals (CIs). To compare the exact match accuracy of LLMs to the accuracy of a second human annotator, we utilized a non-inferiority t-test using the stats R package (version 4.4.1; cran.r-project.org). Given the size of each test set, 80% desired power, a 0.05 p-value, and the standard deviation of the second human annotator accuracy, we computed the margin of error measurable by each test set. This power calculation was performed using the pwr R package (version 1.3-0; cran.r-project.org), and we utilized dataset-specific margins in each non-inferiority calculation. Given 294, 130, 72, and 218 reports in the breast, prostate, kidney, and MDS test sets, this resulted in non-inferiority margins of 6.3%, 8.2%, 13.5%, and 5.9%, respectively.

Ethical compliance

This study was approved by the University of California, San Francisco (UCSF) Institutional Review Board. The requirement for written informed consent was waived by the UCSF Institutional Review Board. All experiments were conducted in accordance with the relevant guidelines and regulations.

Results

Report characteristics

Our breast, kidney, prostate, and MDS datasets contained 978, 238, 431, and 724 reports (Fig. 1). Of these reports, 294, 72, 130, and 218 were randomly selected for held-out testing for breast, kidney, prostate, and MDS extractions, and these reports contained an average of 766, 688, 423, and 785 words, respectively. Amongst breast, kidney, and prostate test sets, 65%, 94%, and 85% respectively contained a confirmation of cancer. 7% of the MDS test set confirmed untreated MDS. We report detailed text and label statistics for all four datasets in Table 1.

Table 1 Detailed outcome and text characteristics for each Dataset.

Inter-annotator analysis

We report our prompts for each dataset and task in Table 2 and a detailed model performance evaluation in Table 3. Our primary performance metric, exact match accuracy, evaluates the ability of a model to correctly extract all fields for a report, using the labels of the first human annotator as the reference. Our second human annotators obtained exact match accuracies of 75.5 (95% CI 70.1, 80.3), 83.3 (95% CI 73.6, 90.3), 83.1 (95% CI 76.2, 88.5), 85.8 (95% CI 80.7, 89.9) in the breast, kidney, prostate, and MDS test set, respectively. These correspond to average Cohen’s kappa coefficient scores of 0.87, 0.78, 0.88, and 0.54 for each dataset and inter-annotator agreement among 97.3%, 94.0%, 93.3%, and 92.4% of fields. This extraction-level error rate aligns with prior studies on medical record abstraction, which report error rates ranging from 0.7% to 20.8%25. The inter-annotator agreement sources of disagreement are reported in Supplementary Figure S1.

Table 2 Task-specific LLM prompts for each Dataset.
Table 3 Dataset-Level exact match accuracies for the second human Annotator, GPT-4 and Open-source LLMs.

Our manual review revealed that on average across all datasets, 48.7% of disagreements stemmed from human error and 39.4% from inherent ambiguity in the pathology reports. In the breast dataset, many human errors were due to omission of the lymph node in the specimen source, which are often buried within long reports describing multiple breast samples. In the kidney dataset, left-right labeling errors were common. In the MDS dataset, many errors seemed to stem from the complexity of the reports. Annotators have to extract multiple pieces of information from long, detailed narratives. More detailed metrics are shown in Supplementary Fig. S2.

LLM performance

We compare model accuracies to the accuracies of independent second human annotators using non-inferiority tests. We consider a model to have obtained “human-level performance” if it meets the accuracy of our second human annotator. When leveraging dataset-specific fine-tuning, Llama-3.1 8B obtained exact-match accuracies of 91.5 (95% CI 88.1, 94.6), 87.5 (95% CI 77.8, 94.4), 91.5 (95% CI 86.2, 95.4), 89.4 (95% CI 84.9, 93.1) in the breast, kidney, prostate, and MDS test set, respectively. This model obtained higher accuracies than the second human annotator in all four datasets, and the result was statistically non-inferior in all test sets (p < 0.001). These results are summarized in Fig. 4, and we report additional detailed metrics by data field in Table 4. We report additional metrics, including average accuracies and precision, recall, and F1 scores, in Supplementary Tables S3, S4, and S5.

Fig. 4
Fig. 4
Full size image

Summarized dataset-level results. Bar graph shows exact match accuracy for each dataset. Selected models include the second human annotator, zero-shot GPT-4, and fine-tuned and zero-shot Llama 3.1 8B. The human, snowflake, and flame icons represent the human, zero-shot, and fine-tuned models respectively.

Table 4 Detailed comparison of fine-tuned Llama 3.1 to second human Annotator.

Without fine-tuning (i.e. zero-shot), open-source LLMs obtain significantly worse performance than our second human annotator, who had an average exact match accuracy of 81.9%. DeepSeek-R1-Distill-Llama-8B obtains the highest overall performance, with an average exact match accuracy of 56.8% and 72.4 (95% CI 67.0, 77.6), 66.7 (95% CI 55.6, 76.4), 80.8 (95% CI 73.1, 86.9), and 7.3 (95% CI 4.1, 11.5) on breast, kidney, prostate and MDS, respectively. Llama-3.1 8B obtained similar zero-shot performance averaging 55.8 over the four datasets, and 42.9 (95% CI 37.1, 48.6), 75.0 (95% CI 63.9, 84.7), 28.5 (95% CI 20.8, 36.9), and 76.6 (95% CI 70.6, 82.1) for each dataset. Critically, Llama-3.1 8B obtained this performance without the use of any intermediate reasoning steps and, hence, was readily amenable to further fine-tuning. Gemma-2-9B and Mistral-v0.3 7B both obtained worse performance than Llama-3.1 8B, with average exact match accuracies of 52.1% and 48.6%, respectively. We found that LLama-3-8B-UltraMedical and PMC-LLaMA 13B both performed worse than Llama 3.1 8B, with 39.1 and 16.7 average exact match across datasets.

Llama-3.1 8B obtained qualitatively similar results across fine-tuning settings, obtaining average exact match accuracies ranging from 85.5% to 90.0% compared to 81.9% by second human annotators. These results suggest that researchers can obtain human-level results if they annotate data for a single task (i.e., question-specific finetuning); alternatively, institutions can train a single model to support all tasks (i.e., mixed-dataset finetuning) to obtain human-level results. Dataset-specific fine-tuned Llama-3.1 8B significantly outperformed its zero-shot counterparts across all datasets (p < 0.05), with high average precision and recall of 93.9 and 91.0. We visualize model predictions for clinical tasks in Supplementary Figure S2. GPT-4, which received no fine-tuning, obtained an average exact match accuracy of 86.1% across the four datasets. GPT-4 was non-inferior (p < 0.001) to the second human reader in breast, prostate, and MDS test sets. GPT-4 did not obtain non-inferior accuracy relative to the second human annotator (80.6% vs. 83.3%, p = 0.09) in the kidney test set.

Learning curve analysis and hyperparameter sensitivity

To understand the minimum number of training samples required to achieve human-level performance, we retrained our dataset-specific fine-tuned Llama-3.1 8B model with subsets of our training set (Fig. 5). Across each test set, we report the performance of our second human annotator on our dotted line. Our fine-tuned Llama-3.1 8B matched human performance when leveraging only 100, 95, 43, and 72 training reports for the breast, kidney, prostate, and MDS test sets, respectively. Each dataset-level model took 1–3 h to train using a single A40 GPU. At the time of writing, these experiments would cost between $0.80 to $2.40 on a private cloud. Next, we assessed the sensitivity of our models to the optimal choice of training hyper-parameters, including learning rate, number of training epochs, LoRa scaling (i.e., alpha), and LoRa rank (Fig. 6). We found that our fine-tuning results are sensitive to the choice of learning rate; while leveraging a learning rate of 1e-4 obtains a dataset-average exact-match accuracy of 89.3%, learning rates of 1e-3 and 1e-5 obtain degraded accuracies of 5.0% and 80.3%, respectively.

Fig. 5
Fig. 5
Full size image

Learning curves analysis across all four datasets. Blue line graphs show the exact match accuracy for dataset-specific finetuned Llama 3.1 when leveraging 0%, 25%, 50%, and 100% of the training data. The dotted line displays the performance of the second human annotator. The stars show the number of reports used to match performance of the second human annotator.

Fig. 6
Fig. 6
Full size image

Hyperparameter sensitivity analysis for mixed-dataset fine-tuned Llama 3.1 8B. Heat maps display the average dataset-level exact match accuracy across datasets, across a range of values for each hyper-parameter. Llama 3.1 is robust to the choice of training epochs, LoRA (low-rank adaptation) scaling, and LoRA rank, but the choice of learning rate can significantly alter performance.

Discussion

We developed Strata, a lightweight open-source library for LLM-based clinical information extraction. Strata supports a wide range of LLMs, prompt development, parameter-efficient fine-tuning, model evaluation, and deployment. Using Strata and minimal compute resources, namely a single commodity GPU (Nvidia A40) per experiment, we demonstrated that fine-tuned open-source LLMs achieve human-level performance in diverse, real-world clinical research applications across prostate MRI, breast pathology, kidney pathology, and myelodysplastic syndrome pathology reports. Our fine-tuned Llama-3.1 8B model was non-inferior to a second human annotator in all test sets. Moreover, Llama-3.1 8B demonstrated human performance leveraging only 100, 95, 43, and 72 training reports in our breast, kidney, prostate, and MDS test sets. Under the guidance of expert clinicians, Strata now supports multiple LLM-backed research databases at our institutions. This deployment reflects not just technical performance, but also clinician acceptance of the tool’s outputs in real research workflows. We demonstrate that locally hosted fine-tuned LLMs offer a practical solution for clinical information extraction, requiring limited computational and human effort; our “low-code” open-source library is designed to empower individual researchers to curate high-quality (i.e., human-level) databases at their instruction, supporting their downstream AI research.

Much of the prior work on using LLMs for clinical information extraction has focused on commercial models such as ChatGPT26. In our study, we found that GPT-4 demonstrated human-level zero-shot performance in three out of four test sets, namely breast, prostate, and MDS, and it was not non-inferior to our second annotator in the kidney dataset. Relying on commercial models to develop research databases poses several challenges. Using commercial models for protected health information requires institutional agreements to address privacy concerns, creating a significant access barrier. Moreover, rigorous research practices require reproducibility; as commercial LLMs are updated without user oversight, researchers cannot easily reproduce prior model queries or ensure consistent extraction quality. In contrast, our small open-source LLMs can be locally hosted, version-controlled, and easily customized for new applications.

While there has been rapid progress in adapting open-source LLMs to medical20,21 and general reasoning tasks22, we found that both biomedically fine-tuned LLMs and general reasoning models did not outperform the base instruction-tuned Llama-3.1 8B model in clinical information extraction. Specifically, we found that Llama-3-8B-UltraMedical and PMC-Llama 13B obtained substantially worse performance than Llama-3.1 8B. Leading reasoning LLMs, including deepseek-r1-distill-llama-8b, obtained similar performance to Llama-3.1 8B, despite the use of intermediate reasoning steps. In contrast, our fine-tuned Llama-3.1 significantly exceeded the zero-shot Llama-3.1 model, obtaining human-level performance.

We found that open-source LLMs address the limitations of traditional NLP techniques. Trivedi et al.27 previously evaluated the performance of conventional NLP techniques applied to breast pathology reports and found that performance was lower for cases over 1024 characters with an F-measure of 0.83. In contrast, we saw consistent performance across our datasets, where all reports across all test sets contained over 1024 characters. Yala et al.6 developed statistical machine-learning tools for parsing breast pathology reports; while the tool achieved an exact-match accuracy of 90%, it required over 6000 annotations to reach this performance. In contrast, we demonstrated that we could achieve human-level results leveraging only ≤ 100 training set reports, addressing the primary limitation of this prior work. As a flexible framework for low-code evaluation of new models, Strata complements recent advances in LLM adaptation for low-resource biomedical settings. Given Strata, valuable directions for future work include detailed investigations in targeted prompt tuning, output correction strategies, and alternative learning rate schedules.

Our study has limitations. We used datasets from two tertiary academic institutions, which may not account for variability in pathology report structures, terminology, and style across different institutions. Our clinical information extraction tasks only capture a fraction of the data elements in clinical reports. Additional work is needed to broaden our results to more tasks and institutions.

Conclusion

Our study shows that fine-tuned open-source LLMs can achieve human-level performance while leveraging modest annotation (i.e., ≤ 100 reports) and computational resources (i.e., <$3.00). Self-hosted open-source LLMs offer a compelling and high-performance alternative to commercial LLMs. We release Strata, our lightweight library, to broaden access to these tools.