Introduction

Large language models (LLMs) show promise for diagnosis support, treatment planning, and patient communication1,2,3, but moving from general explanation to patient-specific reasoning is hard. Clinical use demands precise interpretation, consistent specialty terminology, and outputs aligned to standardized taxonomies. This is especially acute in ophthalmology, where laterality, quantitative measures, named exams, and longitudinal follow-up are central.

Small, domain-specific LM (small language model; SLM)-used interchangeably here with “compact LLM” to denote resource-efficient smaller models, whereas LLM denotes the general class of large language models-offer a practical path forward: they run on-premise with lower latency and cost while keeping data on site for privacy and compliance; and, with case-based fine-tuning, they more readily internalize fine-grained knowledge (terminology, laterality, code schemas) and support tighter output governance (guardrails, selective abstention, calibration). In practice, SLMs help bridge linguistic fluency to clinically reliable, code-aligned reasoning and improve workflow fit and deployability.

Ophthalmology exemplifies these demands. Clinicians must reason over domain-specific constructs such as laterality, quantitative measurements, named examinations, and longitudinal follow-up, and they must map conclusions to interoperable code systems used for reporting and downstream workflows. General-purpose LLMs, often trained on broad biomedical or encyclopedic text, may be fluent yet insufficiently grounded for these context-sensitive tasks.

To address these gaps, we introduce Ophtimus-V2-Tx, a compact, domain-specialized ophthalmology model adapted from a modern 8B-parameter foundation model4 via parameter-efficient fine-tuning. Rather than relying on loosely structured narratives, we fine-tune on schema-structured case reports that capture end-to-end clinical workflows-presentation, examinations, diagnostic interpretation, therapeutic decisions, and follow-up-preserving details such as laterality and quantitative values. This case-based approach aims to couple clinical specificity with computational efficiency suitable for on-premise settings.

We evaluate model outputs as clinicians would consume them: free-text rationales are mapped to standardized code systems (ICD-10-CM for diagnoses, ATC for medications, ICD-10-PCS for procedures) and scored hierarchically to reflect both exact matches and clinically proximate near-misses. In parallel, we assess the narrative alignment of the free-text itself using complementary text metrics (e.g., ROUGE-L, BLEU, METEOR) and a semantic-similarity measure. This dual perspective highlights cases where clinical reasoning is consistent even when exact codes differ, and provides a more complete picture of output quality beyond strict code identity. For fairness and reproducibility, the same frozen coding agent and prompt are applied to both reference labels and predictions, with deterministic normalization and clearly specified mapping rules.

In comparative studies, Ophtimus-V2-Tx is competitive with leading general-purpose models on clinically grounded outputs and shows advantages in targeted settings, while improving over prior ophthalmology-tuned baselines on narrative measures and structured reasoning. Taken together, these results point to a practical path for compact, domain-adapted models that combine clinical specificity with deployability.

Contributions

  • A case-report–based fine-tuning strategy that strengthens clinical specificity and preserves ophthalmology-relevant structure (e.g., laterality, measurements, named tests).

  • Ophtimus-V2-Tx, a compact ophthalmology model designed for realistic decision-support use and efficient on-premise operation.

  • A hierarchical, code-aligned evaluation spanning diagnoses, medications, and procedures, complemented by free-text similarity assessment to capture narrative alignment.

  • A reproducible and resource-efficient pipeline, including an auditable mapping procedure, intended to lower the barrier to safe, site-ready deployment and future benchmarking5,6.

Results

Comparative evaluation strategy

To rigorously assess the clinical utility of our proposed model, Ophtimus-V2-Tx, we conduct a multi-dimensional comparative evaluation against the following baseline models (Table 1):

Table 1 Evaluated comparative models.

Three of the models in this comparison-Ophtimus-V1-Inst6, Ophtimus-V2-Inst, and Ophtimus-V2-Tx-are our custom instruction-tuned models. Among them, Ophtimus-V2-Inst and Ophtimus-V2-Tx were both developed based on a common pretrained foundation: the Ophtimus-V2-Base7 model, with details of its pretraining corpus provided in the “Methods” section. (Removed redundant sentence: “which was pretrained on a large-scale ophthalmology corpus composed of domain-specific textbooks (4M) and PubMed ophthalmology articles (18.4M).”) These two models were then fine-tuned for different tasks. The Ophtimus-V2-Inst model was fine-tuned primarily on multiple-choice question answering (MCQA) and essay-style QA datasets, with a focus on clinical reasoning and medical knowledge comprehension. It retained a strong alignment with structured, educational-style questions commonly seen in medical board exams. In contrast, the Ophtimus-V2-Tx model extended the same base foundation through additional fine-tuning that combined the original instruction datasets with real-world clinical case reports, specifically curated to emphasize treatment planning, therapeutic decision-making, and post-treatment follow-up guidance. By incorporating case-based narratives into the fine-tuning data, Ophtimus-V2-Tx was optimized not only for diagnostic reasoning but also for generating contextually appropriate management plans grounded in ophthalmology practice.

The evaluation focuses on four key dimensions: (1) general performance on multiple-choice questions, (2) clinical accuracy based on text similarity, (3) clinical accuracy using the CliBench framework, and (4) keyword-specific MCQA accuracy. First, we conduct a general performance comparison using a medical multiple-choice question answering (MCQA) dataset that includes diagnostic, treatment, and procedural questions. This baseline comparison helps assess the models’ general reasoning capabilities under structured question formats. Second, we perform a text similarity evaluation to assess the linguistic alignment between model-generated outputs and human-written references. This includes the use of ROUGE-L, BLEU, METEOR, and SEMSCORE metrics8,9,10,11, which measure surface-level and semantic correspondence across generated sentences. Third, we adopt the CliBench framework to perform fine-grained, task-specific evaluation across three clinical dimensions: semantic similarity to ground-truth sentences, diagnostic classification accuracy, and treatment decision accuracy. CliBench is a clinically grounded benchmark that maps model outputs to standardized medical code systems (ICD-10-CM, ATC, ICD-10-PCS) and evaluates performance across multiple hierarchy levels (L1–L4 and Full code match). This enables hierarchical F1 score computation for each clinical task.

  • Diagnostic Accuracy: Model-generated primary and secondary diagnoses are automatically mapped to ICD-10-CM codes and evaluated using hierarchical F1 scores.

  • Treatment Accuracy: Medications are mapped to ATC codes and surgical procedures to ICD-10-PCS codes. Accuracy is again measured by hierarchical F1 scores, highlighting the model’s ability to generate valid therapeutic outputs.

This CliBench-based evaluation is the core of our analysis, as it directly measures the clinical validity of model-generated diagnoses and treatment plans using medically structured standards. It complements surface-level metrics by evaluating whether the outputs are clinically appropriate within a structured reasoning framework.

Finally, we analyze MCQA performance at the keyword level. Clinical topics such as macular degeneration, glaucoma, and diabetic retinopathy are used to cluster questions and compute model-wise accuracy within each topic. This allows us to identify domain-specific strengths and weaknesses and highlights the subfields where Ophtimus-V2-Tx provides the most substantial improvement.

Qualitative comparison of clinical outputs across models in case-based scenarios

Table 2 presents a comparative evaluation of diagnostic and treatment suggestions generated by GPT-4o, Ophtimus-V2-Inst, and Ophtimus-V2-Tx for four real-world ophthalmic cases. While GPT-4o and Ophtimus-V2-Inst generally provided partially correct or broad responses, Ophtimus-V2-Tx consistently produced more specific and clinically accurate outputs aligned with expert ground truth. These results underscore the effectiveness of case-based fine-tuning in enhancing complex treatment reasoning capabilities.

Table 2 Comparative diagnosis and treatment suggestions across models for four clinical cases.

Sentence similarity analysis

Table 3 presents the performance comparison across four widely used automatic evaluation metrics for language generation: ROUGE-L, BLEU, METEOR, and SEMSCORE. Each metric captures a distinct aspect of text quality, providing a comprehensive view of model performance.

Table 3 Performance comparison across aggregated generation metrics (mean and 95% CI via cheap subsampling bootstrap). Bold values indicate the best-performing scores within each column.

ROUGE-L measures the longest common subsequence between the generated and reference sentences, focusing on structural similarity and content overlap. BLEU evaluates precision over short n-grams, reflecting how accurately the generated output reproduces the lexical content of the reference. METEOR, in contrast, incorporates stemming and synonym matching to account for semantically equivalent expressions even when exact wording differs. Lastly, SEMSCORE leverages pre-trained sentence embedding models (e.g., SBERT) to compute semantic similarity, emphasizing the overall meaning conveyed rather than exact word usage.

Across aggregated generation metrics with 95% confidence intervals (Table 3), Ophtimus-V2-Tx exceeds OpenAI GPT-4o on ROUGE-L (0.40 [0.40, 0.41] vs. 0.18 [0.17, 0.18]), BLEU (0.26 [0.26, 0.27] vs. 0.05 [0.05, 0.05]), and METEOR (0.45 [0.45, 0.46] vs. 0.29 [0.29, 0.29]); the 95% CIs are non-overlapping, indicating robust improvements in structural and lexical alignment under our bootstrap protocol.

For SEMSCORE, OpenAI GPT-4o attains a slightly higher estimate (0.82 [0.81, 0.82]) than Ophtimus-V2-Tx (0.80 [0.80, 0.81]); because the 95% CIs overlap at 0.81, we characterize semantic similarity as comparable rather than conclusively different.

Taken together, these results suggest that domain-specific pretraining and targeted instruction tuning in Ophtimus-V2-Tx yield substantial gains in structural/lexical fidelity while maintaining high semantic adequacy. In line with best practice, we report effect directions alongside uncertainty quantification rather than blanket superiority claims.

Diagnosis analysis

Table 4 summarizes the hierarchical F1 scores for all CLIBENCH tasks—Primary and Secondary Diagnosis, Medication, andSurgical Procedure—along with their 95 percent confidence intervals. For each task, the best-performing scores at each hierarchicallevel (L1 to Full) are highlighted in bold.

Primary diagnosis results

Ophtimus-V2-Tx (8B) attains the best L2 score (0.58) and ties for the best L4 score (0.40). At L3 it is essentially on par with OpenAI API (0.51 vs. 0.52), while at Full it reaches 0.23 versus OpenAI’s 0.26. Overall, Tx shows strength at mid/fine levels (L2/L4) in terms of structural understanding and code concordance; most gaps fall within overlapping 95% CIs, underscoring its competitiveness despite the 8B parameter scale.

Secondary diagnosis results

Ophtimus-V2-Tx achieves the best Full (exact code) score (0.15) and ties for the best L4 (0.17) (cf. OpenAI Full 0.12, L4 0.17). OpenAI leads at L1–L3, but Tx is strongest in fine-grained, code-level alignment. Because clinical workflows such as billing, registry curation, and downstream decision support often require exact-code agreement, this pattern is practically meaningful.

Contributions and implications of Ophtimus-V2-Tx (8B)

In sum, the domain-tuned 8B Ophtimus-V2-Tx competes directly with a large proprietary system and shows consistent advantages at fine-grained levels (L4) and Full-metrics that matter in clinical use. We attribute this to (i) a schema-first (case-structured) input that preserves quantitative values, laterality (OD/OS/OU), test names (OCT/FA/OCTA), and explicit findings, and (ii) source-level evaluation in which reference labels from the original texts are normalized to ICD-10-CM for comparison. Moreover, the parameter efficiency (8B) enables on-premise/edge deployment and lower inference cost, supporting immediate workflows such as diagnosis code suggestion/verification, case-based differential diagnosis candidates, and documentation/registry quality control.

In addition to reporting confidence intervals for each model, we also examined confidence intervals for the pairwise differences in accuracy and F1 scores using bootstrap resampling across all evaluation tasks. This provides a conservative statistical test: if the 95% CI of the difference excludes zero, the performance gap can be considered statistically significant.

Table 4 Hierarchical F1 scores for CliBench across all tasks—primary/secondary diagnosis, medication, and surgical procedure—with 95% confidence intervals (CI).

Treatment result analysis

CliBench imposes exact ICD-10-CM/ATC/ICD-10-PCS matching from free text, which is stricter than MCQA and can understate clinically reasonable outputs that miss a specific leaf code. To contextualize absolute scores, we additionally report hierarchical proximity (L1–L4 vs. Full) and text-level semantic similarity (e.g., ROUGE-L, BLEU, METEOR/semantic score). These complementary views decouple clinical alignment from exact-code identity and show that many errors are near-misses rather than clinically irrelevant.

On CliBench, treatments were evaluated as Medication (ATC) and Surgical Procedure (ICD-10-PCS) using hierarchical F1 at L1–L4 and Full (exact code), with 95% CIs reported.

Medication

OpenAI leads across levels, while Ophtimus-V2-Tx (8B) narrows the gap at fine granularity: at Full it reaches 0.31 versus OpenAI’s 0.32 with overlapping CIs, indicating practically comparable exact-code concordance despite the parameter gap.

Surgical procedure

The two models tie at L1 (0.75 each), showing similar recognition of broad procedure classes. OpenAI is stronger at L2–L3, but Full converges (0.13 for Tx vs. 0.16 for OpenAI; overlapping CIs), suggesting proximity at the exact-code level.

Implications

These results show that an 8B, schema-first model can deliver near state-of-the-art exact-code performance for medication and competitive surgical coding at the coarse level, supporting first-pass code suggestion/verification and treatment-option drafting with lower compute cost and on-prem/edge deployability.

Contributions and implications of Ophtimus-V2-Tx (8B)

As a compact, ophthalmology-tuned 8B model, Ophtimus-V2-Tx competes with much larger systems, with its strongest gains at deep taxonomy tiers and exact-code agreement. An evidence-anchored case template coupled with code-normalized evaluation yields auditable, easily integrated outputs and strong accuracy-per-compute. This enables on-prem/edge use for first-pass coding, differential candidate lists, and documentation QA. The shared schema/codebooks extend to treatment tasks and other subspecialties, providing a practical route to scalable, privacy-preserving clinical LLMs.

Benchmark-based performance comparison

Evaluation datasets

To evaluate and compare the performance of Ophtimus-V2-Tx against other medical language models, we employed three datasets, each representing different evaluation dimensions: domain-specific expertise, professional-level QA robustness, and literature-based inference (Table 5).

  • Ophtimus-Eval-V15: This dataset was independently constructed to benchmark ophthalmology-specific MCQA performance. All questions were curated from public educational resources and reviewed by medical experts to ensure quality. Notably, it allows topic-wise performance analysis across 19 ophthalmic subfields. Access is currently restricted to protect clinical validation integrity.

  • MedMCQA (Ophthalmology Subset)12: Extracted from a national medical exam dataset (NEET-PG), this subset contains thousands of ophthalmology-focused questions. It serves to test the transferability of general medical QA models to the ophthalmology domain and includes expert-reviewed explanations for some questions.

  • PubMedQA (Ophthalmology Subset)13: This subset targets biomedical inference by linking questions to PubMed abstracts with Yes/No/Maybe labels. Although limited in size, it evaluates a model’s ability to reason over real-world biomedical literature in ophthalmology.

Table 5 Comparison of evaluation datasets used for ophthalmology-related model benchmarking.

General performance analysis

Table 6 shows the overall classification accuracy of each model across three multiple-choice question answering (MCQA) datasets: Ophtimus-Eval-V1 (ophthalmology-specific), MedMCQA (medical licensing-style questions), and PubMedQA (biomedical research questions).

Table 6 Overall classification accuracy (proportion) with 95% CIs (cheap subsampling bootstrap; \(m=0.6N\), \(B=100\)) on the Ophtimus-Eval-V1, MedMCQA, and PubMedQA datasets.

Compared to other baselines, Ophtimus-V2-Inst achieved strong performance on Ophtimus-Eval-V1 (0.63) and competitive results on MedMCQA (0.71) and PubMedQA (0.72), showing a balanced understanding across domain-specific, clinical, and general biomedical contexts. Interestingly, Ophtimus-V2-Tx exhibited a distinctive pattern: while it showed lower accuracy on Ophtimus-Eval-V1 (0.43) and MedMCQA (0.53)-likely due to its focus on generative treatment tasks rather than QA fine-tuning-this trade-off will be further discussed in a later section.

On PubMedQA, Ophtimus-V2-Tx achieved an accuracy of 0.88, which is competitive with OpenAI GPT-4o (0.91) and higher than all other compared models. Given the relatively small sample size of the ophthalmology subset of PubMedQA (\(N=297\)), these results should be interpreted with caution. To reflect this uncertainty, we also report 95% confidence intervals and confusion matrices in Fig. 7. In contrast, many general-purpose LLMs such as LLaMA 3-8B Instruct and PMC-LLaMA-13B showed inconsistent or modest performance across the three benchmarks. OpenAI GPT-4o demonstrated the highest overall robustness but was still outperformed by Ophtimus-V2-Tx on PubMedQA. However, given that the ophthalmology-specific subset of PubMedQA is relatively small in size, further validation is needed to confirm the consistency of these results. Overall, these results illustrate the trade-offs between task specialization and QA generalization. The two instruction-tuned variants of Ophtimus-V2 exhibit complementary strengths: Ophtimus-V2-Inst excels in domain-specific and clinically structured MCQA, while Ophtimus-V2-Tx shows remarkable generalization to biomedical inference tasks despite limited MCQA fine-tuning.

Figure 1 shows the token distribution used in fine-tuning Ophtimus-V2-Tx, revealing that 29.1% of its training data was derived from ophthalmology case reports. In contrast, Ophtimus-V2-Inst was trained exclusively on QA-formatted data from PubMed and textbook sources. Despite sharing the same base model, fine-tuning objective, and loss function, Ophtimus-V2-Tx showed a notable drop in MCQA performance (e.g., 0.43 on Ophtimus-Eval-V1), likely due to a reduced proportion of QA-specific signals. The inclusion of narrative-style clinical cases may have diluted the model’s QA alignment, underscoring the trade-off between generalization to treatment reasoning and specialization in structured QA tasks.

Fig. 1
figure 1

Token distributions over training datasets.

Topic-wise performance analysis

Figure 2 presents a comparative analysis of multiple-choice question answering (MCQA) accuracy across 20 ophthalmic subtopics. The number of questions per topic is indicated in parentheses, providing context for interpreting model performance. For this analysis, we used only the Ophtimus-Eval dataset, since it is the only benchmark among the three (Ophtimus-Eval, MedMCQA, PubMedQA) that provides explicit categorization by ophthalmic keywords (e.g., Uveitis, Glaucoma, AMD). This allowed us to assess performance variation across clinical subdomains. PubMedQA and MedMCQA lack topic-level annotations, and thus were not suitable for topic-wise comparison.

Fig. 2
figure 2

Topic-wise performance comparison on ophthalmology QA (number of test samples per topic in parentheses).

Overall, OpenAI GPT-4o demonstrated consistently strong performance across most topics, showing particularly high accuracy in general areas such as General Ophthalmology (986), Pharmacology (40), and Conjunctiva (21). However, it did not always achieve the top score, as certain specialized ophthalmic topics were better handled by the Ophtimus series models. Ophtimus-V2-Inst, trained on MCQA and essay-style reasoning questions, achieved competitive or even superior accuracy in domains such as Neuro-Ophthalmology (165), Ocular Trauma (28), and Systemic Diseases (8). Especially in Neuro-Ophthalmology, which requires reasoning across neurology and ophthalmology, the model showed strong task-specific inference capabilities. Interestingly, although Ophtimus-V2-Tx was fine-tuned for treatment planning and follow-up generation, it exhibited high accuracy in clinically reasoning-intensive domains such as Uveitis (40), Retina and Vitreous (13), and Oculoplastic (11).

We note, however, that this topic-wise result is presented as an exploratory observation rather than a definitive benchmark of complexity. Given the small sample size (N = 40) and the wide confidence interval, these findings should be interpreted with caution, serving only to illustrate potential trade-offs introduced by case-report fine-tuning.

Uveitis is a complex inflammatory condition with diverse etiologies that demands high-level clinical reasoning and differential diagnosis skills14. This suggests that the case-based learning approach used to train the Tx model effectively enhanced its domain understanding and therapeutic decision-making capabilities. In summary, GPT-4o achieved the highest overall average performance, but Ophtimus-V2-Inst proved to be a competitive domain-specialized model for medical QA, and Ophtimus-V2-Tx demonstrated strengths in clinically complex topics that require fine-grained therapeutic reasoning. These results highlight the complementary roles of general-purpose and task-specific medical language models and offer valuable guidance for model selection and design in clinical applications (Table 7).

Table 7 Topic-wise accuracy comparison with training sample distribution. Bold values indicate the best-performing accuracy within each topic between the two models. 

Methods

Ophthalmic foundation model: Ophtimus-V2-base

Figure 3 illustrates the overall development pipeline for our ophthalmology-specialized language model. The process consists of four main stages: Source Data Gathering, Text Pre-processing and Pre-training, Instruction-based Fine-Tuning, and Evaluation.

Fig. 3
figure 3

Training and evaluation workflow for Ophtimus-V2 language models.

In the source data collection stage, we restricted retrieval to the PubMed Central (PMC) Open Access subset and the English language, applying article-type filters to include case reports and closely related clinical report categories. Using ophthalmology-specific Boolean/MeSH queries—25 for the Base pretraining corpus and 29 for the case-report corpus (exact query lists, field tags, and access dates in Table 9 and the Supplement)—we retrieved candidate records. We then performed document-level deduplication using PMCID, removed near-duplicates, and conservatively removed any personally identifiable information (PII). Aggregating across sources yielded approximately 18.4M tokens from PMC and about 4.0M tokens from domain-specific textbooks (e.g.,15,16,17,18,19,20,21). Across the textbook subset used for pretraining, we count 18 volumes and 7962 pages in total, corresponding to an average of 442.3 pages per volume. Together these sources form the final Ophthalmology Corpus dataset.

This curated corpus was then used to pre-train a foundation model based on LLaMA 3.1, resulting in the creation of the Ophtimus-V2-Base model. This base model is optimized for general understanding and generation of ophthalmic text, such as clinical descriptions, diagnostic reasoning, and treatment explanations.

In the next stage, we performed instruction tuning to train the model on medical question-answering tasks. We constructed a domain-specific QA dataset using the OpenAI API, focusing on both multiple-choice and essay-style formats that reflect real-world clinical reasoning. Through supervised fine-tuning with LoRA techniques22, we derived the Ophtimus-V2-Inst model, which is tailored to handle complex medical queries and reasoning tasks effectively and efficiently.

Finally, we conducted a comprehensive evaluation using both external benchmarks and our own Ophtimus-Eval-V1 dataset-a carefully curated multiple-choice QA (MCQA) evaluation set based on ophthalmic case scenarios collected from Academia.edu. Ophtimus-Eval-V1 was designed to assess overall accuracy and keyword-specific performance. This allowed us to identify domain-specific strengths, such as performance variations across key ophthalmic topics like macular degeneration, glaucoma, and diabetic retinopathy.

 Data Sources and Usage Policy  

Domain textbooks (e.g., BCSC, Kanski) used in this work are copyrighted. We employed them exclusively for non-consumptive, research-only text mining to derive structured fields and aggregate statistics. The pretraining textbook subset comprises 18 volumes totaling 7962 pages (mean 442.3 pages per volume). No copyrighted text or images are redistributed; the released artifacts consist only of model weights, code, prompts, schema definitions, and aggregate metrics. For reproducibility, we provide an open-access–only retraining recipe and report ablations demonstrating similar performance trends without copyrighted sources.

Ophthalmic case report dataset construction

Figure 4 illustrates the overall process used to construct the ophthalmology case report dataset for fine-tuning the Ophtimus-V2-Tx model. The process consists of three main stages: Data Crawling, Data Filtering, and Formatting Data for Fine-tuning.

Fig. 4
figure 4

Dataset construction for Ophtimus-V2-Tx.

  1. (1)

    Data crawling

    We first collected documents from two major sources: PubMed (PMC Open Access only, English, no date limit) and an ophthalmology case-report textbook. Using ophthalmology-specific keywords/MeSH (25 for Base; 29 for Case Report), we retrieved 10,377 case-report candidates from PubMed (initial retrieval), followed by screening and deduplication. After PMCID-based document-level deduplication and eligibility screening, 8261 articles remained; from these, we structured 10,815 individual patient cases. In parallel, we extracted and grouped case narratives from Clinical Cases in Medical Retina (1st ed.; 74 chapters). This source is copyrighted; materials were used strictly under the publisher’s terms for research-only text mining, and no original text or derivatives were redistributed. To prevent data leakage, textbook-derived content was used only for pretraining/fine-tuning and was excluded from evaluation.

  2. (2)

    Data filtering

    We applied multi-stage filtering with the OpenAI API to both PubMed papers and textbook-based documents.

    Before invoking the API, we applied rule-based gates (document type/language/length/keywords) with deduplication and near-duplicate removal. We then used a frozen OpenAI API strictly as a text classifier (no generation), with prompts limited to document-type decisions-we did not solicit subjective “paper relevance.” Prompt templates are provided verbatim in Appendix B.2.

    In the first step, the classifier operated on PubMed papers only, checking ophthalmic domain membership, language, and the presence of experimental/control groups.

    In the second step, both PubMed papers and textbook documents underwent privacy/irrelevance screening: residual identifiers (names/initials, institution names, contact lines, exact visit dates, addresses, image identifiers) were removed or normalized, while clinical signals (quantitative values, laterality tokens, test names, explicit findings) were preserved; the PHI policy and examples appear in Appendix B.1 and B.2.

    Through this two-phase filtering, we selected a final set of 8261 PubMed documents and 74 textbook chapters as usable ophthalmology case-report documents for fine-tuning.

  3. (3)

    Formatting data for fine-tuning

    The filtered documents were converted to a schema-first, programmatic format-not free-form summarization. A rule-based mapper populated a fixed template with two blocks (Patient Information; Diagnosis and Treatment). The OpenAI API was used only as a frozen fallback field router to disambiguate spans (e.g., findings vs. rationale), never to paraphrase or condense; ambiguous cases were flagged for spot-check. The routing prompt is given in Appendix B.3, and the schema in Table 8.

Table 8 Structured format for case reports.

We utilized a total of 10,815 cases (one document may contain multiple patient cases) for training and evaluation. Among these, 9,193 cases were assigned to the training set and 1622 cases to the evaluation set through random splitting. The 76 cases collected from the textbook were included only in the training data. While a single case-report document may contain multiple patient cases-meaning that cases from the same document could be assigned separately to both training and evaluation sets-this does not pose a problem, as each case represents an independent patient.

Table 9 Retrieval statistics of case report-relevant ophthalmology papers by keyword (N = 8261).

Fine-tuning strategy

Figure 5 illustrates our fine-tuning strategy for the ophthalmology-specific adaptation of the Ophtimus-V2-Base model. To develop Ophtimus-V2-Tx, a domain-specialized language model tailored for ophthalmology, we began with the pretrained foundation model, Ophtimus-V2-Base (8B parameters), and further refined it using a combination of QA-based datasets and real-world clinical case reports.

Fig. 5
figure 5

Fine-tuning for Ophtimus-V2-Tx.

To enhance the model’s clinical reasoning capabilities in real-world ophthalmic contexts, we incorporated 9193 structured case reports covering a wide range of ophthalmologic conditions, carefully formatted to capture end-to-end diagnostic workflows (patient presentation, symptom progression, relevant examinations, diagnostic interpretation, treatment decisions, and follow-up considerations). From the final pool of 10,891 structured cases, 9193 were used for fine-tuning, with the remainder allocated to evaluation (1622) or to textbook-based training only (76).

For efficient domain adaptation, we employed Low-Rank Adaptation (LoRA), a parameter-efficient fine-tuning method. LoRA introduces lightweight trainable matrices into each transformer layer, enabling specialization to new domains without modifying the entire set of pretrained parameters. This strategy offers computational efficiency while retaining the base model’s general medical knowledge. The integration of general-domain QA data with LoRA-based fine-tuning on case report data resulted in the final model, Ophtimus-V2-Tx. Trained on a total of 95,157 QA and case examples, this model combines broad medical understanding with precise clinical reasoning abilities for ophthalmology-specific tasks.

Training setup

We fine-tuned Ophtimus-V2-Tx from Ophtimus-V2-Base using parameter-efficient Low-Rank Adaptation (LoRA) without applying quantization. The configuration was as follows:

  • LoRA: \(r=32\), \(\alpha =16\), dropout \(=0.05\); target_modules = {q_proj, v_proj, k_proj, o_proj, gate_proj, up_proj, down_proj}.

  • Optimizer and LR schedule: AdamW optimizer; learning rate = \(2{\times }10^{-4}\); weight decay = 0.01; warmup steps = 10; constant-with-warmup scheduler.

  • Batching: per-device train batch size = 8; gradient accumulation = 8; effective batch size across 3 GPUs = 192.

  • Training duration and hardware: 25 epochs on 3×NVIDIA A100 80GB GPUs, total training time  3 days.

  • Sequence length: maximum context length = 4096 tokens.

  • Evaluation/save: eval strategy = steps (every 25 steps), save strategy = epoch, logging steps = 1, seed = 42.

Inference settings

Unless otherwise noted, decoding used a maximum context of 4096 tokens and generation length up to 512 tokens, with temperature 0.7 and top-p 0.9. Greedy sampling was not used; no beam search was applied. We fix these parameters across evaluations to ensure comparability.

Scoring criteria

To quantitatively assess the quality of diagnosis and treatment sentence generation by Ophtimus-V2-Tx and baseline models, we employed four representative natural language generation (NLG) evaluation metrics: ROUGE-L, BLEU, METEOR, and SEMSCORE. These metrics jointly evaluate not only lexical overlap but also sentence structure, syntactic coherence, and semantic similarity-an essential consideration in clinical settings where the accurate and consistent conveyance of information directly influences patient care.

Confidence intervals For all key metrics (accuracy, F1, precision, recall), we report 95% confidence intervals estimated via a cheap subsampling bootstrap (m = 0.6N, B = 100 resamples)23.

  • ROUGE-L (Longest Common Subsequence). ROUGE-L evaluates structural similarity between the generated sentence C and the reference sentence R based on the length of their longest common subsequence (LCS). It reflects both the precision (how much of C matches R) and the recall (how much of R is preserved in C), using an F-score formulation:

    $$\text {ROUGE-L} = \frac{(1 + \beta ^2) \cdot \text {LCS}\_\text {precision} \cdot \text {LCS}\_\text {recall}}{\text {LCS}\_\text {precision} + \beta ^2 \cdot \text {LCS}\_\text {recall}}$$

    where:

    $$\text {LCS}\_\text {precision} = \frac{\text {LCS}(C, R)}{|C|}, \quad \text {LCS}\_\text {recall} = \frac{\text {LCS}(C, R)}{|R|}$$

    \(\beta\) is typically set to 1.2 to favor recall.

  • BLEU (Bilingual Evaluation Understudy). BLEU measures the precision of n-gram matches between the candidate and the reference sentences, with a brevity penalty to discourage overly short outputs. The overall score is computed as:

    $$\text {BLEU} = \text {BP} \cdot \exp \left( \sum _{n=1}^N w_n \cdot \log p_n \right)$$

    where \(p_n\) is the modified n-gram precision, \(w_n\) is the weight for each n-gram (commonly \(w_n = \frac{1}{N}\)), and BP is the brevity penalty:

    $$\text {BP} = {\left\{ \begin{array}{ll} 1, & \text {if } c > r \\ \exp \left( 1 - \frac{r}{c}\right) , & \text {if } c \le r \end{array}\right. }$$

    with c and r representing the lengths of the candidate and reference sentences, respectively.

  • METEOR (Metric for Evaluation of Translation with Explicit ORdering). METEOR addresses BLEU’s limitations by incorporating both unigram precision and recall, while also accounting for stemming, synonym matching, and word order alignment. The metric computes a harmonic mean (\(F_{mean}\)) and applies a penalty based on fragmentation:

    $$\text {METEOR} = F_{mean} \cdot (1 - Penalty)$$

    where:

    $$F_{mean} = \frac{10 \cdot P \cdot R}{R + 9P}, \quad Penalty = \gamma \cdot \left( \frac{ch}{m} \right) ^\theta$$

    Here, P and R are unigram precision and recall, ch is the number of matched chunks, m is the number of matches, and \(\gamma\), \(\theta\) are penalty parameters (e.g., \(\gamma = 0.5\), \(\theta = 3\)).

  • SEMSCORE (Sentence Embedding Similarity Score). SEMSCORE evaluates semantic similarity by computing the cosine similarity between dense sentence embeddings, such as those produced by Sentence-BERT:

    $$\text {SEMSCORE}(C, R) = \cos (\textbf{e}_C, \textbf{e}_R) = \frac{\textbf{e}_C \cdot \textbf{e}_R}{\Vert \textbf{e}_C\Vert \cdot \Vert \textbf{e}_R\Vert }$$

    where \(\textbf{e}_C\) and \(\textbf{e}_R\) denote the embedding vectors of the candidate and reference sentences. This allows the metric to capture meaning equivalence even in the absence of direct lexical overlap.

Evaluation framework: CliBench

In this study, we adopted a structured evaluation framework based on the CliBench benchmark to quantitatively assess the clinical quality of diagnostic and treatment sentences generated by large language models24. CliBench goes beyond simple question–answering tasks by replicating the complex decision-making processes required in real-world clinical settings. It encompasses a diverse range of medical tasks-such as diagnosis, procedures, laboratory testing, and medication prescriptions—each of which is mapped to standardized medical code systems (ICD-10-CM, ICD-10-PCS, LOINC, ATC). This enables model-generated outputs to be evaluated in terms of their clinical fidelity at the level of medical coding granularity.

A core feature of CliBench is its use of hierarchical taxonomies within medical code systems. Diagnostic outputs are evaluated across progressively finer-grained levels of specificity, from broad disease categories (L1), to subgroups (L2–L4), and finally to full-code matches. Similarly, medication and surgical procedure outputs are scored based on the structured hierarchies of the ATC and ICD-10-PCS systems, respectively. This multi-level evaluation scheme enables fine-grained assessment of both the model’s clinical reasoning capabilities and its ability to generate terminology that aligns with standardized classification systems.

Table 10 presents an example of hierarchical medical coding applied to a clinical case involving diagnosis, medication, and surgical procedure. Each entry is mapped to standardized medical codes—ICD-10-CM for diagnosis, ATC for medication, and ICD-10-PCS for surgical procedure. The table also illustrates the hierarchical structure of each code across four levels (L1–L4), enabling fine-grained clinical evaluation.

Table 10 Example of hierarchical medical coding for diagnosis, medication, and surgical procedure.

As an illustrative example, the diagnosis Ophthalmomyiasis externa—a parasitic infestation of the external eye by fly larvae—is mapped to the ICD-10-CM code B87.81. This code belongs to a hierarchical classification system that allows for multi-level evaluation, as shown below:

  • L1-B: Infectious diseases

  • L2-87: Myiasis (infestation by fly larvae)

  • L3-8: Other myiasis

  • L4-1: External ocular involvement

This layered structure enables hierarchical evaluation of model-generated diagnoses. Even when the model fails to produce an exact full-code match, partial matches at higher levels (e.g., L1 or L2) still indicate meaningful clinical relevance. Such stratified evaluation is crucial in measuring not only exact correctness but also partial reasoning alignment

To apply the CliBench framework to the ophthalmology domain, we constructed an evaluation dataset based on 1622 ophthalmic case reports that were not used during model training. For each case, the Ophtimus-V2-Tx model was prompted to generate four types of responses: primary diagnosis, secondary diagnosis, medication, and surgical procedure. The generated sentences were automatically mapped to corresponding standardized medical codes using the Perplexity: ICD-10-CM for diagnoses, ATC for medications, and ICD-10-PCS for surgical procedures.

Fig. 6
figure 6

Clibench-based Ophtimus-V2-Tx evaluation.

Ground-truth and evaluation (code-to-code)

From the curated set (10,891 cases = 10,815 PubMed + 76 textbook), we fine-tune on 9269 cases (9193 PubMed + 76 textbook) and hold out 1622 PubMed cases for evaluation. We extract atomic Diagnosis/Treatment sentences from the held-out cases with PHI masked and obtain reference labels by mapping them via a frozen coding agent to ICD-10-CM, ATC, and ICD-10-PCS. No clinician adjudication or generative rewriting is used; ambiguous mappings are spot-checked. For assessment, the same structured context is given to each LLM; model outputs are recoded by the same agent with an identical prompt to produce predicted codes. We compute precision, recall, and F1 at the code level with hierarchical scoring (L1–L4, Full), optimal matching for multi-code sets, and penalties for omissions/overcoding. This code-to-code design quantifies clinical fidelity, improves fairness with a single coder/prompt, and preserves privacy via PHI masking and standardized mapping (see Fig. 6).

Code Mapper analysis

Using code datasets (ATC: \(n=1000\); ICD-10-CM: \(n=198\); ICD-10-PCS: \(n=802\)) from Refs. 25 and 26, we compare four mappers-OpenAI, Claude, Gemini, and Perplexity-across the three systems. We report a composite Bias score (lower is better) built from abstention rate, error distance, close-miss rate, and system-specific under-specificity; accuracy is deliberately excluded (Table 18). The meta ranking is OpenAI (0.372) < Gemini (0.446) < Claude (0.479) < Perplexity (0.615) (Table 19). By system: ATC-Perplexity attains the lowest bias but shows the largest error distance when wrong; OpenAI exhibits higher abstention and close-miss. ICD-10-CM-OpenAI has the lowest bias (very low abstention, high specificity), while Perplexity is penalized by high abstention. ICD-10-PCS-Gemini achieves the lowest bias; Claude abstains least but has higher close-miss; Perplexity ranks worst due to abstention despite relatively low error distance. We did not use GPT-4o (or any LLM) to map or adjudicate predictions; each mapper’s raw code output was evaluated via deterministic normalization (upper-casing, dot/space removal), with ‘X’ treated as abstention. Prompts and post-processing were identical across mappers.

Because our operational goal is fine-grained correctness, L4-level accuracy favors Perplexity across systems (Table 20). We therefore adopt Perplexity as the default mapper and plan rule-based guardrails (triggers: abstention “X”, ATC “Other” L3/L4 = ‘X’, CM .9 unspecified, PCS Device/Qualifier = ‘Z’) with cross-checks using Gemini/Claude/OpenAI when triggered. See Appendix C.1 for details.

Inter-mapper agreement (Fleiss’/Cohen’s \(\kappa\)) was computed as an exploratory diagnostic and not used for scoring, ranking, or model selection (see Appendix C.1): Inter-mapper agreement is near-perfect for ATC at L2 (Fleiss’ \(\kappa =0.912\)) and moderate for ICD-10-CM/PCS (\(\kappa \approx 0.55\)) despite large label spaces, indicating consensus beyond chance. Pairwise \(\kappa\) shows the strongest concordance along the Claude–Perplexity axis (CM/PCS \(\kappa \approx 0.74\)), with Gemini–Perplexity and Claude–Gemini also strong; OpenAI agrees less in CM/PCS. See Appendix C.1 for full pairwise/Fleiss’ \(\kappa\) tables, methods, and caveats.

Discussion

Impact of case-based fine-tuning

As evidenced by the performance of Ophtimus-V2-Tx in Tables 3 and 4, case-based fine-tuning improves, on selected tasks under our evaluated setting, the clinical reasoning capacity of domain-specific language models by aligning outputs with real-world medical workflows. This approach outperformed GPT-4o and instruction-tuned baselines on specific structured tasks (e.g., surgical procedure generation and full-code diagnosis classification), with modest absolute gains in several categories. Performance was also comparable on the medication task. Improvements in sentence similarity and higher full-code accuracy on selected tasks, despite the comparatively small model size, suggest that fine-tuning on authentic clinical case reports can capture nuanced decision-making often missed by QA-only training. However, specialization introduces trade-offs: Ophtimus-V2-Tx showed reduced performance on MCQA-style benchmarks, indicating a potential dilution of structured QA capabilities-a limitation we attribute to domain specialization and model capacity rather than universal superiority, as discussed below.

Previous ophthalmology-specific LLMs-Optha-LLaMA2, EyeGPT, and LEME-show early progress but remain limited in scope. Optha-LLaMA2 focuses on MCQA-style benchmarks and demonstrates strong exam performance yet does not address real-world case-based reasoning. EyeGPT extends GPT architectures to ophthalmology QA tasks with broad coverage but lacks validation for treatment planning or structured code-based evaluation. LEME, an open-source model, provides domain alignment but omits standardized diagnostic/therapeutic coding. In contrast, Ophtimus-V2-Tx is, to our knowledge, the first ophthalmology-specific model fine-tuned on over 10,000 structured case reports and evaluated with the CliBench framework, helping bridge educational QA-style benchmarks and real-world clinical reasoning with code-aligned outputs.

Strengths and limitations of compact LLMs

The performance of Ophtimus-V2-Tx, built on a LLaMA-3.1 8B base model, achieves competitive results on our benchmarks when carefully fine-tuned for the target domain. This has implications for computational efficiency and deployment in resource-constrained, privacy-critical settings (e.g., hospitals). Despite smaller size, Ophtimus-V2-Tx matched-and in some cases exceeded-GPT-4o on selected structured tasks (medication/procedure full-code predictions); differences should be interpreted cautiously where absolute gains are modest. LoRA-based tuning enabled efficient adaptation without full model retraining.

That said, compact design imposes limitations in general-purpose reasoning, transfer, and language coverage. Compared with GPT-4o, Ophtimus-V2-Tx underperformed on general QA datasets such as MedMCQA and exhibited topic-specific variance across ophthalmic subfields (cf. Fig. 2). These patterns likely reflect distributional shifts in training tokens (Fig. 1) and limited capacity of smaller models. Overall, compact models are well-suited for targeted applications, but dataset design and training strategy should balance domain specialization with broader reasoning. A practical route is ensemble- or agent-based designs wherein several compact, sub-domain-specialized models collaborate to capture both specificity and generalization.

Data validation limitations

Although our preprocessing pipeline employed a fixed schema to preserve clinical details verbatim, we recognize that complete validation of textual fidelity across the full corpus remains a limitation of this study. The OpenAI API used in routing was strictly operated as a frozen, non-generative classifier intended only to disambiguate schema boundaries (e.g., findings vs. rationale) without rephrasing. Nevertheless, because this process still depends on LLM-based routing, there remains a possibility of minimal abstraction or paraphrasing that was not exhaustively ruled out. To partially assess fidelity, a subset of structured case reports underwent qualitative review by ophthalmologists. However, due to the scale of data, a systematic corpus-wide audit could not be performed, and we consider this an inherent methodological limitation of the present work.

Limitations

Our evaluation uses “ground truth” labels produced by a single frozen code-mapping agent (Perplexity), applied consistently to both references and predictions; we use a fixed prompt template and deterministic decoding to minimize variance. Accordingly, all reported scores quantify alignment with a consistent automated coding policy rather than clinician-adjudicated correctness. While this design improves reproducibility, reliance on a single policy introduces mapping bias, and-despite contamination precautions-data leakage from large public corpora cannot be fully excluded. Qualitative review indicates that some apparent errors stem from structural limitations of coding taxonomies (e.g., overlapping ICD-10 hierarchies; divergent ATC leaf mappings for semantically equivalent therapies), so hierarchical-F1 and sentence-similarity metrics may understate clinically acceptable near-misses. In addition, absolute gains are modest in several categories, and generalization remains limited beyond the evaluated ophthalmic tasks and datasets. We therefore interpret low full-code accuracies as the combined effect of model errors and taxonomy constraints, and we frame this work as a reproducible policy-alignment benchmark, not as evidence of clinical accuracy or deployment readiness. Appendix C reports sensitivity analyses with independent mappers and partial human coding. Future work will include multi-rater clinician adjudication on stratified samples (blinded review, consensus procedures, inter-rater agreement such as \(\kappa\)), prospective small-scale evaluations on external notes, and policy-sensitivity analyses across alternative coders and prompting policies.

CliBench utility and generalizability

Applying CliBench to four core tasks-primary/secondary diagnosis, medication, and surgical procedure prediction-revealed fine-grained strengths (e.g., higher surgical procedure accuracy and improved full-code matching) while also exposing areas for improvement. Its hierarchical structure (L1–L4, Full) supports nuanced assessment beyond surface-level correctness and better reflects clinical reasoning fidelity. CliBench also supported comparisons across model sizes and training paradigms, illustrating how Ophtimus-V2-Tx generalizes reasonably on inference-heavy tasks such as PubMedQA despite being tuned for treatment planning; however, these results should be viewed as exploratory rather than uniformly superior. To control mapping variance, we used a single frozen coding agent with an identical prompt for both references and predictions, ensuring a consistent policy across systems. We do not rely on GPT-4o for mapping. As a limitation, we did not include a clinician coding baseline in this submission; Appendix C reports sensitivity analyses (independent mapper and partial human coding), and future work will add multi-rater clinician annotations with inter-rater agreement (e.g., \(\kappa\)) and near-code error typology.

Deployment considerations and intended use

For deployment, the model is intended for decision support rather than autonomous coding. We recommend precision-first operating points with selective abstention, mandatory human review for low-confidence or high-impact codes, and site-specific calibration (reliability assessment/temperature scaling) prior to go-live. A small pre-deployment audit and post-deployment drift monitoring provide additional safeguards for safe integration into coding workflows.

Related works

Expansion of general-purpose and medical-specific LLMs

The recent emergence of various general-purpose Large Language Models (LLMs) has led to remarkable advancements in the field of natural language processing (NLP), demonstrating impressive performance across key tasks such as question answering, summarization, and code generation27,28,29,30. Building on these successes, the application of general-purpose LLMs has rapidly expanded into the medical domain, leading to growing interest in clinical language processing31.

In parallel, a variety of medical-specific LLMs have been developed to address the unique demands of clinical settings. These models have shown significant performance in medical question answering, chest X-ray report generation, and standardized medical examinations such as the USMLE32,33,34,35,36,37,38,39. A comprehensive review of the development trends of medical LLMs can be found in Ref. 40.

While these general-purpose and medical-specific LLMs have demonstrated remarkable capabilities in broad medical NLP tasks, they often lack specialization in underrepresented yet clinically important domains such as ophthalmology. In contrast, our work focuses on the development of a lightweight, domain-specific model tailored for ophthalmology, addressing the gap in clinical reasoning performance for this specialty.

Ophthalmology-specific LLMs

In highly specialized medical fields such as ophthalmology, general-purpose LLMs often struggle to support domain-specific diagnostic reasoning or clinical decision making. In fact, degraded performance has been observed on ophthalmology-centered evaluations such as the OKAP, USMLE, and ophthalmology board exams41,42,43. To address these limitations, specialized models such as EyecareGPT44, LEME45, and RETFound46 have been developed for tasks including diagnostic generation, image interpretation, and analysis of long-form clinical case reports.

Unlike previous ophthalmology models that focus primarily on image interpretation or long-form summarization, Ophtimus-V2-Tx is designed to support structured diagnostic reasoning using real-world case reports. It generates multi-dimensional clinical outputs-including primary and secondary diagnoses, medications, and surgical procedures-and is evaluated through code-based frameworks, enabling fine-grained and semantically grounded performance assessment.

LLM-based clinical decision support

While LLMs have achieved strong performance in medical question answering and clinical reasoning tasks47,48, they still face limitations in effectively supporting the end-to-end, structured clinical decision-making process in real-world healthcare environments. To overcome this, instruction-tuned models tailored for the medical domain have been proposed. However, most are optimized for benchmark datasets such as MedQA, PubMedQA, and USMLE-style exams rather than real clinical workflows49,50,51.

Recently, various benchmark studies have been conducted to evaluate the architectures, fine-tuning strategies, and clinical applicability of such models, providing a foundation for their integration into real clinical decision support systems24,52,53,54,55. Comprehensive reviews of these clinical LLMs are available in Refs. 56 and 57.

While many instruction-tuned LLMs are optimized for benchmark datasets and question-answering formats, they often fall short of modeling the structured and sequential nature of real clinical decision-making. Our approach integrates instruction-tuning with a structured evaluation methodology based on the CliBench framework, enabling assessment of end-to-end reasoning through standardized coding systems (ICD-10-CM, ATC, ICD-10-PCS) that align with actual clinical workflows.

Practical value of compact LLM

Given the substantial computational and memory requirements of large language models, there has been increasing attention toward more lightweight alternatives known as compact LLM. These models leverage techniques such as knowledge distillation, pruning, model compression, and quantization to achieve high performance with significantly reduced resource demands. compact domain-specialized LLM are particularly valuable in resource-constrained environments, such as mobile devices, edge computing platforms, and on-premise hospital servers.

Notable examples of compact LLM include DistilBERT58, TinyBERT59, MobileBERT60, and TinyLlama61, which are actively used in clinical chatbot systems and medical information retrieval assistants. A detailed survey of these small models and optimization techniques can be found in Ref. 62, which systematically reviews recent advances in model compression strategies and evaluates their performance across various clinical NLP tasks.

Compared to existing general-purpose and domain-specific LLMs, our study introduces a novel ophthalmology-focused small language model that bridges performance and practicality. By leveraging real-world clinical narratives, multi-faceted output generation, and semantically grounded evaluation through standardized medical coding systems, Ophtimus-V2-Tx presents a scalable and clinically relevant alternative for intelligent decision support in specialty medicine.

Conclusion

Summary of findings

We present Ophtimus-V2-Tx, a small, domain-specific language model (SLM) for ophthalmology, tuned on schema-structured case reports and evaluated with automated diagnostic reasoning and standardized benchmarking tasks (ICD-10-CM/ATC/ICD-10-PCS; L1–L4 and Full) alongside narrative similarity metrics.

This study is positioned as a benchmarking comparison between a compact, ophthalmology-specific model (Ophtimus-V2-Tx) and large general-purpose LLMs, designed to assess how domain-specific fine-tuning and structured supervision influence diagnostic reasoning and representational alignment in ophthalmic narratives. The purpose of this work is to derive methodological insights into effective development choices for domain-specialized medical LLMs, not to claim clinical deployment readiness. The code-mapping tasks (ICD/ATC/PCS) are employed solely as standardized, quantifiable benchmarks for evaluating model reasoning, rather than as evidence of autonomous code-generation capability.

On our benchmarks, Ophtimus-V2-Tx achieves competitive performance on selected structured tasks, with fine-grained strengths at exact medication/procedure codes and certain hierarchical levels, while absolute gains are modest in several categories and statistical significance is not uniform.

In the context of prior ophthalmology-focused LLMs (Ophtha-LLaMA2, EyeGPT, LEME)-which emphasize MCQA or text QA and generally lack code-based evaluation-our case-based tuning and code-aligned assessment help bridge educational QA benchmarks and real-world clinical reasoning within a reproducible benchmarking framework that translates narrative reasoning into standardized, comparable metrics.

All reported metrics reflect alignment with a frozen automated coding policy rather than clinician-adjudicated correctness. Accordingly, the evaluation framework serves as a policy-alignment benchmarking environment to quantify reasoning alignment, not to infer clinical accuracy or deployment feasibility. To enhance reproducibility, the coding agent and prompts were version-locked during evaluation (see Methods). Future work will incorporate structured clinician review of model outputs to anchor automated results to clinically meaningful standards, alongside sensitivity analyses across alternative coding agents and prompting policies to quantify policy dependence, complementing multi-rater adjudication and prospective validation in subsequent iterations.

Future directions

Methodologically, we will expand clinician-adjudicated reference subsets and conduct structured, blinded multi-rater evaluations (with predefined consensus procedures and inter-rater agreement reporting, e.g., \(\kappa\)) to validate automated mapping outputs, characterize near-miss typologies, and ground benchmarking metrics in clinically interpretable standards. In parallel, targeted clinician review of representative model outputs will be introduced to qualitatively assess reasoning fidelity and contextual correctness, providing a bridge between automated evaluation and clinical relevance.

Beyond model evaluation, we will implement a structured program of data-fidelity verification to address current limitations in schema-level validation. Specifically, we will run a statistically powered, stratified sampling audit (stratified by diagnosis/procedure/medication and by document length/source) to quantify token- and field-level fidelity using predefined rubrics: exact-match rates for named tests, laterality agreement (OD/OS/OU rules), and numeric consistency with absolute/relative tolerances and unit checks, along with a structured error taxonomy (omission, substitution/paraphrase, numeric transformation, laterality inversion). We plan to report inter-rater agreement (e.g., \(\kappa\) or Gwet’s AC1) and, where applicable, 95% confidence intervals for the clinician-reviewed subset. In parallel, reinforcement-learning–based preference alignment (RLHF and related human-feedback methods) will be used to automatically detect and correct deviations from verbatim schema preservation. Collectively, these procedures will operate as a continuous data-governance loop, improving fidelity, quality control, and reproducibility across subsequent Ophtimus iterations.

We will also adopt precision-first operating points with selective abstention, site-specific calibration, and post-deployment drift monitoring to ensure safe experimental use in decision-support and research benchmarking contexts, rather than autonomous clinical coding.

Technically, we will explore multimodal inputs (e.g., OCT and fundus imaging) and agentic or ensemble architectures to enhance reasoning robustness and domain specificity, while extending evaluation to additional ophthalmic subspecialties and expanded QA datasets for broader benchmarking.

Continued mapper sensitivity analyses (including cross-mapper checks) will be maintained as code systems and data distributions evolve.