Ophtimus-V2-Tx: a compact domain-specific LLM for ophthalmic diagnosis and treatment planning

Kwon, Minwook; Jang, Kuk Jin; Baek, Seung Ju; Han, Yong Seop; Choi, Hyonyoung; Lee, Insup; Kim, Jin Hyun

doi:10.1038/s41598-025-27410-1

Download PDF

Article
Open access
Published: 10 December 2025

Ophtimus-V2-Tx: a compact domain-specific LLM for ophthalmic diagnosis and treatment planning

Minwook Kwon¹,
Kuk Jin Jang⁴,
Seung Ju Baek¹,
Yong Seop Han³,
Hyonyoung Choi²,
Insup Lee² &
…
Jin Hyun Kim¹

Scientific Reports volume 15, Article number: 43532 (2025) Cite this article

1201 Accesses
1 Altmetric
Metrics details

Subjects

Abstract

Large language models (LLMs) show promise for clinical decision support but often struggle with case-specific reasoning. We present Ophtimus-V2-Tx, an 8-billion-parameter ophthalmology-specialized LLM fine-tuned on more than 10,000 case reports. Evaluation is conducted on a pre-collected dataset. Alongside text metrics (ROUGE-L, BLEU, METEOR) and a semantic similarity score, we use CliBench to map outputs to standardized codes (ICD-10-CM, ATC, ICD-10-PCS) and compute hierarchical F1 (L1–L4 and Full), with code mapping used strictly as an evaluation tool. Ophtimus-V2-Tx is competitive with a state-of-the-art general model and stronger in several settings. It improves text metrics (ROUGE-L 0.40 vs. 0.18; BLEU 0.26 vs. 0.05; METEOR 0.45 vs. 0.29) with comparable semantic similarity. On CliBench, it attains a higher full-code score for secondary diagnosis and ties or leads at selected granular levels for primary diagnosis, while medication and procedure results are close with overlapping confidence intervals. Relative to other ophthalmology-tuned baselines, it shows consistently higher text-generation scores. These findings indicate that a compact, domain-adapted model can approach-or in targeted settings, exceed-large general LLMs on clinically grounded outputs while remaining feasible for on-premise use. We also describe an auditable evaluation pipeline (frozen coding agent, identical prompts, hierarchical metrics) to support reproducibility and future benchmarking.

Enhancing clinical documentation with voice processing and large language models: a study on the LAOS system

Article Open access 28 November 2025

A simple eye model for objectively assessing the competency of direct ophthalmoscopy

Article 09 August 2021

Analyzing patient perspectives with large language models: a cross-sectional study of sentiment and thematic classification on exception from informed consent

Article Open access 20 February 2025

Introduction

Large language models (LLMs) show promise for diagnosis support, treatment planning, and patient communication^1,2,3, but moving from general explanation to patient-specific reasoning is hard. Clinical use demands precise interpretation, consistent specialty terminology, and outputs aligned to standardized taxonomies. This is especially acute in ophthalmology, where laterality, quantitative measures, named exams, and longitudinal follow-up are central.

Small, domain-specific LM (small language model; SLM)-used interchangeably here with “compact LLM” to denote resource-efficient smaller models, whereas LLM denotes the general class of large language models-offer a practical path forward: they run on-premise with lower latency and cost while keeping data on site for privacy and compliance; and, with case-based fine-tuning, they more readily internalize fine-grained knowledge (terminology, laterality, code schemas) and support tighter output governance (guardrails, selective abstention, calibration). In practice, SLMs help bridge linguistic fluency to clinically reliable, code-aligned reasoning and improve workflow fit and deployability.

Ophthalmology exemplifies these demands. Clinicians must reason over domain-specific constructs such as laterality, quantitative measurements, named examinations, and longitudinal follow-up, and they must map conclusions to interoperable code systems used for reporting and downstream workflows. General-purpose LLMs, often trained on broad biomedical or encyclopedic text, may be fluent yet insufficiently grounded for these context-sensitive tasks.

To address these gaps, we introduce Ophtimus-V2-Tx, a compact, domain-specialized ophthalmology model adapted from a modern 8B-parameter foundation model⁴ via parameter-efficient fine-tuning. Rather than relying on loosely structured narratives, we fine-tune on schema-structured case reports that capture end-to-end clinical workflows-presentation, examinations, diagnostic interpretation, therapeutic decisions, and follow-up-preserving details such as laterality and quantitative values. This case-based approach aims to couple clinical specificity with computational efficiency suitable for on-premise settings.

We evaluate model outputs as clinicians would consume them: free-text rationales are mapped to standardized code systems (ICD-10-CM for diagnoses, ATC for medications, ICD-10-PCS for procedures) and scored hierarchically to reflect both exact matches and clinically proximate near-misses. In parallel, we assess the narrative alignment of the free-text itself using complementary text metrics (e.g., ROUGE-L, BLEU, METEOR) and a semantic-similarity measure. This dual perspective highlights cases where clinical reasoning is consistent even when exact codes differ, and provides a more complete picture of output quality beyond strict code identity. For fairness and reproducibility, the same frozen coding agent and prompt are applied to both reference labels and predictions, with deterministic normalization and clearly specified mapping rules.

In comparative studies, Ophtimus-V2-Tx is competitive with leading general-purpose models on clinically grounded outputs and shows advantages in targeted settings, while improving over prior ophthalmology-tuned baselines on narrative measures and structured reasoning. Taken together, these results point to a practical path for compact, domain-adapted models that combine clinical specificity with deployability.

Contributions

A case-report–based fine-tuning strategy that strengthens clinical specificity and preserves ophthalmology-relevant structure (e.g., laterality, measurements, named tests).
Ophtimus-V2-Tx, a compact ophthalmology model designed for realistic decision-support use and efficient on-premise operation.
A hierarchical, code-aligned evaluation spanning diagnoses, medications, and procedures, complemented by free-text similarity assessment to capture narrative alignment.
A reproducible and resource-efficient pipeline, including an auditable mapping procedure, intended to lower the barrier to safe, site-ready deployment and future benchmarking^5,6.

Results

Comparative evaluation strategy

To rigorously assess the clinical utility of our proposed model, Ophtimus-V2-Tx, we conduct a multi-dimensional comparative evaluation against the following baseline models (Table 1):

Table 1 Evaluated comparative models.

Subjects

Abstract

Similar content being viewed by others

Enhancing clinical documentation with voice processing and large language models: a study on the LAOS system

A simple eye model for objectively assessing the competency of direct ophthalmoscopy

Analyzing patient perspectives with large language models: a cross-sectional study of sentiment and thematic classification on exception from informed consent

Introduction

Contributions

Results

Comparative evaluation strategy

Qualitative comparison of clinical outputs across models in case-based scenarios

Sentence similarity analysis

Diagnosis analysis

Primary diagnosis results

Secondary diagnosis results

Contributions and implications of Ophtimus-V2-Tx (8B)

Treatment result analysis

Medication

Surgical procedure

Implications

Contributions and implications of Ophtimus-V2-Tx (8B)

Benchmark-based performance comparison

Evaluation datasets

General performance analysis

Topic-wise performance analysis

Methods

Ophthalmic foundation model: Ophtimus-V2-base

Data Sources and Usage Policy

Ophthalmic case report dataset construction

Fine-tuning strategy

Training setup

Inference settings

Scoring criteria

Evaluation framework: CliBench

Ground-truth and evaluation (code-to-code)

Code Mapper analysis

Discussion

Impact of case-based fine-tuning

Strengths and limitations of compact LLMs

Data validation limitations

Limitations

CliBench utility and generalizability

Deployment considerations and intended use

Related works

Expansion of general-purpose and medical-specific LLMs

Ophthalmology-specific LLMs

LLM-based clinical decision support

Practical value of compact LLM

Conclusion

Summary of findings

Future directions

Data availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Additional information

Publisher’s note

Appendix

Appendix

A. Diagnosis and treatement evaluation: accuracy, precision, recall

B.1 PHI removal and preservation policy

Objective

B.2 Prompts for GPT-based filtering

B.3 GPT-based formatting: prompts

C.1 Code Mapper analysis

D. PubMedQA confusion matrix

Rights and permissions

About this article

Cite this article

Share this article

Search

Quick links