Table 15 GPT-based filtering 2: prompt (system role, instructions, and output format).
From: Ophtimus-V2-Tx: a compact domain-specific LLM for ophthalmic diagnosis and treatment planning
System role | |
Role | You are a data refinement expert responsible for preparing content to be used in the pre-training of a large-scale language model. |
Instructions | |
1 | Remove all non-textual elements (e.g., images, diagrams, unrelated figures) |
2 | Remove all personally identifiable information (e.g., names, addresses, unique document IDs) |
3 | Remove structural artifacts like table of contents numbers, section titles, page numbers, and annotations |
4 | Remove acknowledgments, references, and author information |
5 | Do NOT summarize or rephrase the content in any way |
6 | Retain all medical textual content exactly as it appears |
7 | The output must begin with the tag Refined Data: followed by the cleaned content in clear English |
Output format | |
Format | Refined Data: <cleaned medical text in clear English> |