Introduction

Radiology and nuclear medicine diagnostic reports are still dictated as free text, and from them structured, reproducible data must be extracted for clinical trials. Before the advent of large language models (LLMs), natural language processing (NLP) struggled with language ambiguity or with unknown semiosis, whereas LLMs such as the Generative Pre-trained Transformer-4 (GPT-4) already match human cognition across many tasks.

However, three fixed constraints block employing LLMs in healthcare: determinism (answers must not shift with prompt phrasing), traceability (reasoning must be auditable), and confidentiality (protected health data must never leak)1,2,3,4,5. Achieving all three is still an open challenge and constitutes a major bottleneck for integrating LLMs into clinical workflows and trials. If it were solved, LLMs could classify clinical reports, match patients to trials, and mine unstructured research data3,6—capabilities that would hasten the detection of pandemics7, rare side-effects, or malpractice patterns8. We therefore wanted to address the pressing problem how to harness LLMs without sacrificing the above-mentioned healthcare requirements.

LLMs are vast neural networks trained on web-scale general corpora9 and professional domains, including biomedical literature10. Belonging to the branch of stochastic AI, LLMs excel at digesting unstructured text, coping with ambiguous phrasing8,11, and—because it outputs probabilities—reasoning under missing data or uncertainty12. Its learned representations transfer readily between tasks, so a single model can pivot from trial matching to guideline summarisation without bespoke re-engineering, unlike the narrow, task-specific programs that characterised classical expert systems, i.e. the symbolic AI.

Yet the same design brings healthcare-critical drawbacks13. Internal synaptic weights are difficult to explain, so the model’s logic is inexplicable to auditors14. Pattern matching without formal deduction limits multi-step reasoning15 and leaves the system blind to out-of-distribution inputs or to its own nescience16. Outputs are stochastic and prompt-sensitive, where minor re-phrasings of a semantically identical prompt can lead to divergent answers17. This may be a feature for creative writing but a liability when identical clinical facts must yield identical conclusions18. Finally, the distributed services of the most capable LLMs, such as GPT-4, raise confidentiality and alignment concerns5,19.

Compared to neural AI, the older symbolic AI stores knowledge as human-readable symbols and rules in an ontology, and uses formal logic to yield deterministic and reproducible outcomes20. This makes inference chains auditable and allows the program to declare ‘unknown’, ask for some specific data, or trigger other meaningful fall-back, when a rule cannot fire21. Early medical expert systems like MYCIN proved this approach in the 1980s22. However, symbolic AI is labour-intensive to scale and struggles with unstructured or uncertain inputs, and rapidly evolving medical domains23,24.

We therefore propose a unified semantic-neuro-symbolic NLP pipeline. The rationale for this approach is that each component offsets the weaknesses of the others: In our design, GPT-4 harvests clinical facts from free text25, while a locally hosted expert system (Plato-3) verifies these extractions against medical rules, generating deterministic, trustworthy labels21,26. Finally, conventional software provides the practical access needed for real-world clinical workflows. In combination, these elements create a system capable of transforming unstructured diagnostic reports into structured, auditable, and privacy-preserving data suitable for research and patient care.

Recent gains in hardware throughput and programming techniques23 make real-time integration of large neural models and symbolic reasoning systems finally practicable27. Interoperability between different AI systems as well as software required for a meaningful healthcare workflow is still hard: Symbolic AI, neural AI, and conventional software all reason at different abstraction levels and data formats28,29. To bridge these different representations of a same problem we built RUDS (Rule-based Unification of Digital technology using Semantics). This platform provides ‘loose coupling’30 but high cohesion31 through semantic message passing between diverse components—also known as semantic unification32. RUDS implements multiple programming paradigms, which allows diverse components to exchange and interpret information despite differences in data structure, programming style, or level of abstraction33, while the embedded expert system keeps track of context and meta information at all time. By elevating all connected components to a shared semantic representation32, RUDS unites neural cognition with symbolic reasoning in a full semantic-neuro-symbolic AI stack26, realizing the cognitive computing paradigm34. The result pairs LLM–NLP with an auditable inference chain from the expert system35,36, enabling back-tracing from each final label to the originating LLM tokens or human prompt37. This capability is crucial in healthcare, where understanding the ‘why’ behind AI decisions is essential38,39.

Our goal was an exploratory proof-of-concept, demonstrating seamless cooperation between an LLM and a symbolic expert system in autonomously compiling structured clinical data from free-text diagnostic reports. Specifically, our study makes four practical contributions: First, we integrate GPT-4 with the expert system Plato-3 so that extracted facts are validated by medical rules. Second, each AI-generated label includes the complete symbolic reasoning chain and the supporting GPT-4 evidence, providing explainability-by-design in natural language. Third, we show that the system does not require retraining of the language model; domain knowledge is provided entirely through the rule base. Fourth, the architecture is implemented on the semantic-unification platform RUDS, enabling interoperability with conventional software and realizing the vision of a practical neuro-symbolic clinical AI3,26,40,41. In addition, this work also explains the paradigm-unifying architecture of RUDS in detail and why it is needed here. Although our evaluation was modelled after a prior PET/CT clinical study42, it was not designed to produce new clinical findings. Instead, we show in this proof-of-concept study how the combined neuro-symbolic AI accurately extracts and structures 26 clinical parameters from 206 original, unedited [68Ga]Ga-PSMA-11 reports for recurrent prostate cancer (rPC). The system matches physician performance, outperforms GPT-4 alone, produces deterministic results without hallucinations, and prevents privacy breaches by controlling all data transfer. Taken together, this work demonstrates a practical implementation of the autonomous, context-aware AI originally envisioned by the Japanese fifth-generation computing initiative43.

Methods

Patient data

The study retrospectively analysed 206 diagnostic reports from 206 consecutive patients who had undergone [68Ga]Ga-PSMA-11 PET/CT scans in eight months between January and August 2018 at the Department of Nuclear Medicine, Inselspital Bern, Switzerland, adhering to Swiss ethical guidelines44. The Cantonal Ethics Committee Bern (Kantonale Ethikkommission Bern, Murtenstrasse 31 in 3010 Bern, Switzerland) approved retrospective usage of the patient reports (KEK-Nr. 2018–00299). All patients published in this manuscript signed a written informed consent form for the purpose of anonymized evaluation and publication of their data. No additional approval beyond that was obtained.

We chose [68Ga]Ga-PSMA-11 PET/CT reports because they paired a highly variable free-text narrative—including the patient history—with a compact, guideline-defined decision scheme, while the cohort size remained amenable to iterative human cross-checking. This combination creates a tractable but non-trivial testbed: if the neuro-symbolic pipeline can deliver deterministic answers here, it is well-poised to scale to larger clinical trials that share the same ‘unstructured-text + rule set’ pattern.’ Because the retrospective evaluation of PET/CT reports followed the design of a study published by Afshar-Oromieh et al.42, we were able to demonstrate real-world applicability, and the experience gained form this study also qualified the authors to develop the expert system’s ontology.

The diagnostic reports, originally written and checked by three nuclear medicine physicians unrelated to this study, were formatted into PDF files according to our institutional standards, codified and anonymized using the batch processing software PDF Replacer v.1.8.7.0 (pdfreplacer.com), and split evenly into a development set and a validation set. All reports were again checked manually for correct anonymization. We fabricated two wrongly anonymized sets using an author’s name and birthdate for testing the expert system’s ability to recognize un-anonymized reports before sending information to the LLM. The reports included 160 patients with rPC, 11 patients having undergone primary tumour staging (PTS), and 29 patients showing no cancer pathology. Two nuclear medicine physicians (C.M., A.A.O.) consensually extracted 26 study-relevant parameters (Table 1) from the 206 reports, providing a physician-generated reference with 5356 data points, e.g. labels. In case a label could not be elicited, both AI systems and the physicians were instructed to record a ‘N/A’ (not applicable). Inclusion and exclusion criteria were as previously published42, meaning that rPC, PTS, and non-pathological reports needed to be distinguished. For the study, the 206 reports were split evenly into a development set and a validation set.

Table 1 26 study parameters the AI-system was tasked to extract from diagnostic PET/CT reports

Paradigm-integrating platform - theory and implementation

We combined an LLM, an expert system and conventional software into a semantic neuro-symbolic AI system running on our study software. Table 2 summarises how each system component compensates for the other’s limitations in this setup.

Table 2 Strengths of one system component balance weaknesses of other system components in a semantic-neuro-symbolic AI setup

The study software was developed and operated on RUDS (Zentit GmbH, Muri bei Bern, Switzerland; ruds.ch), written in Java™8 (Oracle, Austin, TX, USA) on Netbeans™IDE 8.2 (Apache Software Foundation, Wilmington, DE, USA) on a Dell Precision 5470 laptop (Dell Inc., Round Rock, TX, USA) with Microsoft Windows™10. GPT-4 integration required internet access and an API account from OpenAI Global, LLC (San Francisco, CA, USA). The temperature setting of GPT-4 was left on its default value of one when running the analysis in May 2024.

RUDS’s software architecture enables seamless interaction between diverse computational paradigms of software, thereby removing dependencies in technology, instruction semantics, data, and process order. It integrates multiple programming paradigms representing various levels of abstraction (Fig. 1a) enabling the software to address the entirety of a complex problem33. The multiple levels of abstraction28 reflect real-life systems, processes or workflows (Fig. 1b) as needed. As such, process abstraction is the act of converting a process description into a more abstract form, resulting in a decrease in model components, interactions, and behavioural complexity. This allows for a higher-level representation that captures core ideas or functionality29. Thus, the choice of programming paradigm influences how software processes data and determines, with the addressable abstraction level, the addressable complexity and context of a problem45. Abstraction programming bridges complexity gaps between system components45, by ‘gluing meanings of parts of a discourse into a coherent whole’32 and retains context across abstraction levels46. Merely applying computational methods without considering contextual meaning11 of data will lead to meaningless results47, which can endanger patient safety and proper study outcomes.

Fig. 1: Software architecture and workflow.
figure 1

a Paradigm integrating software architecture showing the programming paradigms (stepped pyramid) used at the respective abstraction level and the software applications (Apps, coloured boxes) used in the workflow. Paradigm integration enables interfacing different AI and software applications for specific workflows into a single system. The universal semantic messaging (Dashed arrow) that ensures information flow across abstraction levels and the blackboard (Black bar) enable generic information exchange between connected components. b Workflow diagram for analysing a single diagnostic PET/CT report. White boxes depict stations in the workflow, where the applications (grey bars) shown in a performed their specific tasks. The black arrows show the flow of data and information between the applications, Plato-3 (blue bar), and GPT-4 (pink bar). The workflow looped for each individual report once. GPT-4 generative pre-trained transformer 4, PDF portable document format, NLP natural language processing.

LLMs usually require specific APIs to couple with non-semantic information technology. Our top semantic abstraction level with its bidirectional semantic information flow enables high-cohesion loose coupling30,31 of semantic and non-semantic information technologies and serves at the same time to inform the human user about the system’s processes. It also enables ‘interface reversal,’ where LLMs and expert systems can either actively engage with and utilize any connected application modules, or the application modules themselves can access AI capabilities. For instance, conventional software can leverage AI for advanced tasks, such as data interpretation or decision support, while AI can call semantically upon these tools for operations like data formatting or numerical processing.

To archive this, all application modules of the study software communicate semantically through the paradigm-integrating platform. For this, the platform uses a universal messaging protocol, recursive universal objects, and a central ‘blackboard’48 acting as a shared information space and common ground49. The seamless semantic communication across paradigms (Fig. 1a) enables e.g. numerical procedures working on a low level of abstraction to ask for support from higher levels of abstraction, e.g. the expert system, for making decisions about details of their own use in a self-referencing procedure50. Algorithms at the procedural abstraction level can therefore adapt at runtime to a given context or to the intent of a computation that can only be inferred at a higher level of abstraction.

Vice versa, processes on a higher level of abstraction can easily access functionality of their choice from lower levels to obtain information they need or to process data as needed for reaching a given goal. Consistently, the platform uses the same design patterns for cross-paradigm access to a graphical user interface or even to APIs for third-party applications such as GPT-4. Information coming from or being send to other digital applications and databases is always first handled at its native abstraction level and then made accessible to applications working at lower or higher levels.

The application modules handle tasks such as text extraction, data pre-processing, and report generation (Supplementary Table S1). All modules exchange information along with meta-information51, such as AI reasoning and data origin. Unlike in most conventional software, information and associated meta-information persists and remains accessible to all subsystems at all abstraction levels, even when the subsystems themselves lack semantic capabilities. Finally, the ‘agent’ software design pattern52 and the data, context, and interaction paradigm53 bestow agency onto parts of the system. Supplementary Table S1 explains the various concepts of the software.

Expert system implementation

The NLP expert system Plato-3 in Prolog21 is an integral part of RUDS. Plato-3 communicates semantically with GPT-4 and the other required application modules, and its rule base realizes the ontology software paradigm54. Plato-3 speaks a distinct German subset of natural language for oncologic PET reports, using the definite clause grammar formalism55. The recursively enumerable language spans the full Chomsky hierarchy of formal grammar classes56. Consequently, Plato-3 accepts unknown words from GPT-4 but provides a defined language set to GPT-4 for unambiguous communication between the two different AI concepts. This semantic unification32 ensures that the two AI systems, despite operating with different internal representations and reasoning methods, share a common semantic framework for consistent and unambiguous communication, thereby realizing the neuro-symbolic paradigm (Fig. 1a).

Plato-3’s ontology contains facts along with their logical relations, rules, and problem-solving methods20. In logic, a fact is a statement that is unequivocally true within a given domain, such as ‘Patient X’s report mentioned a relapse.’ A rule is a conditional statement that connects facts and enables reasoning or inference, such as ‘If a relapse is mentioned, then the PET/CT report is classified as ‘no PTS’.’ These constructs allow Plato-3 to use recursive first-order predicate logic for reasoning and deriving new facts and knowledge from already existing facts47.

Here, the ontology encompasses clinical guidelines for anonymization, study inclusion (Fig. 2), identifying pathological reports, and determining correct prostate-specific antigen (PSA) values. The ontology can integrate new facts and rules from various modules, expressed in natural or symbolic language, enabling self-modification50.

Fig. 2: Simplified excerpt of the Plato-3 ontology used to detect PTS reports.
figure 2

Start 1 yields an initial set of meta-facts, which are re-checked after Start 2. Rhomboids depict fact checking within a rule. Light-green/red nodes mark intermediate reasoning states that create additional facts; dark-green/red nodes mark final verdicts. Fall-back rules are omitted for clarity but would result in “PTS status could not be established” with an accordant reasoning (such as “data missing”). Example – Rule “No PTS 1”: If the patient has undergone radical prostatectomy and no rPC was diagnosed initially in the 80 days before the PET examination, primary-tumor staging is ruled out. In contrast, Rule “No PTS 5” reads: If rPC was diagnosed initially in the 80 days before the PET examination, PTS is still possible. The final confirmation would then be done here with rule PTS7. ADT androgen deprivation therapy, PET positron emission tomography, PSA prostate-specific antigen, rPC recurrent prostate cancer, RPE radical prostatectomy; TNM = Classification of malignant tumours with tumour size (T), lymph nodes (N), and distant metastases (M), c-modifier indicates that stage was tumour stage was determined before treatment.

Study aims and workflow

We tasked the system with extracting 26 study parameters (Table 1) from the manually pre-anonymized and codified 206 PET-reports and answer three main study questions, relevant for patient inclusion and aims of the reference study42. The first study question concerned study inclusion: Does the properly anonymized report describe rPC after radical prostatectomy or PTS (Parameter 9)? The second study question concerned identifying reports mentioning pathology: Was a pathology found by the PSMA PET/CT, was it rPC, and how many tumour locations were found (Parameter 10)? The third study question concerned PSA-levels mentioned in the reports: What was the PSA-level measured at the time closest before the PET/CT scan, and what was the time interval between the PSA and the PET scan (Parameters 5 and 6)?

The answers to these three main study questions could not be simply parsed from the reports by the AI but required inference from other study parameters. To this end, Plato-3 inferred study parameters 1 to 10 from logical relations between already detected study parameters, while extracting parameters 11 to 26 primarily involved parsing and structuring text by GPT-4 and did not require expert system inference (Table 1). This proceeding reflects the division of labour between the two different AI types.

Before sending data to GPT-4, the workflow (Fig. 1b) included the TextSplitApp application module segmenting every report into a clinical history with clinical question, the clinical findings, and the physicians’ conclusions and Plato-3 verifying anonymization. This ensured that the resulting fragments passed to GPT-4 contained no trackable or sensitive information. Then the software used pre-engineered prompts to guide GPT-4 in data extraction. We refined these pre-engineered prompts as well as Plato’s ontology (Fig. 2) through iterative testing on the development dataset. Most prompts asked GPT-4 to provide a reasoning statement at the end of its answer. Plato-3 saved GPT-4’s answers together with its reasoning statements in its database, converted them into semantic facts, and checked these facts against its rules to create new facts. When finding controversial facts, Plato-3 re-consulted GPT-4 with its own reasoning. The primal reasoning of GPT-4 served hereby as a starting point for Plato’s chain of thoughts. Even though, Plato-3 could not verify GPT-4’s primal reasoning, it could rule out controversial facts and therefore accept only reasoning from plausible fact statements as its primal reasoning. All in all, the expert system takes on the role of a quality control reviewer, verifying inputs to and outputs from the LLMs, checking data against clinical rules and study guidelines, and flagging any inconsistencies or breach of rule with clear reasoning41. Supplementary Table S3 provides a detailed description of how the software’s core components direct the workflow.

Statistics and reproducibility

First, we compared output from the neuro-symbolic AI, e.g. GPT-4 combined with Plato-3, to the physician-generated reference established by two physicians and one study nurse. Discrepancies between outputs were manually reassessed by authors A.A.O. and G.A.P. in the respective PET reports, using the AI’s reasoning chain to identify the underlying cause for conflict. This human-driven reassessment generated a new reviewed reference, which served as the new ground-truth for comparing outputs from GPT-4 alone (GPT-4-only), neuro-symbolic AI, and the original physician-generated reference.

To avoid circularity in preparing the ground truth, the adjudication pipeline followed three steps. Dual encoding of every report: an original label was supplied by the reporting physician, while two AI labels were produced independently by the neural (GPT-4-only) and neuro-symbolic pipelines. Blinded adjudication (A.A.O.): all discordant items were re-read and re-labelled by a nuclear medicine physician (A.A.O.) who was blinded to which label came from the AI. Logic concordance check (G.A.P.): the adjudicator’s decision and the AI’s explicit rule trace were then reviewed by a second author (G.A.P.). A correction was accepted only if the adjudicator’s reasoning and the AI’s rule-based trace were fully congruent; otherwise, the original human label was retained.

Each of the 206 PET/CT reports represented one independent sample. For parameter extraction, every report contributed one replicate per parameter, yielding 5356 data points in total. Replicates are therefore defined as individual parameter extractions from independent reports. Each report was analysed once by GPT-4, once by physicians, and once by the combined neuro-symbolic AI. No technical replicates or repeated runs were used.

Sensitivity, specificity, and predictive performance (F-score) were evaluated in discriminating pathological cases and identifying PTS patients for the three comparisons. McNemar’s test with exact binomial testing to calculate two-sided p-values, with p ≤ 0.05 considered statistically significant. Agreement with the reviewed reference in identifying the remaining study parameters was reported as a percentage, with the Pearson’s chi-squared multinomial test applied on parameters 2 to 10, which were shared between GPT-4-only, neuro-symbolic AI and the physician-generated reference (Table 1). Bonferroni correction was used to correct for multiple measurements.

Results

Reviewed reference

Plato-3 blocked the two intentionally wrong anonymized data sets; one with the author’s name and one with the birthdate in plain text. Furthermore, it flagged 17 reports missing disclaimer text. The latter indicated possibly missing written consent, which was retrieved in all cases after manual rechecking. The following manual inspection of discrepancies between the physician-generated reference and AI results revealed a total of 82 human errors (Table 1). Furthermore, the neuro-symbolic AI’s reasoning in natural language (Example given in Fig. 3) was consistently accurate, while GPT-4-only was not. Therefore, all 32 changes in study parameters 1 to 10 suggested by the combined neuro-symbolic AI were accepted into the reviewed reference, while only 52 out of 207 changes in parameters 11 to 26 suggested by GPT-4-only were accepted here. Since no performance differences were observed between the development and validation sets, they were combined for the results. GPT-4 always agreed with Plato-3’s chain of reasoning, when re-prompted (c.f. Step c in Fig. 3).

Fig. 3: Traceable reasoning process in four steps shown on the example of detecting a PTS-report (Patient 15) as presented to the user.
figure 3

a First step with trace of Plato-3 decision. b Second step with trace where Plato-3 identified a controversy between its decision and the decision of GPT-4. c Third step showing the trace with which Plato-3 re-prompted GPT-4 with the identified controversy and its decision. The answer from GPT-4 is the fourth step. Green: Plato-3’s predicates, including the information sources or applied rules in squared parentheses; [notInDB]: Plato-3 could not find the fact in his database. txt: The text contains language elements unknown to Plato-3. Blue: Plato-3’s reasoning; The originally German text was translated to English, replicating Plato-3’s distinct set of natural language. LLM large language model, PET/CT positron emission tomography combined with computed tomography, PSMA prostate-specific membrane antigen, PTS primary tumour staging, n/a not available.

Study inclusion

GPT-4-only correctly excluded two PTS patients from the development set and three from the validation set but missed three PTS patients in each set. There were no false positive PTS categorizations by GPT-4, and it achieved a sensitivity of 0.45, specificity of unity, and F-score of 0.63. Under Plato-3’s supervision, the neuro-symbolic AI correctly excluded all 11 PTS patients from the cohort, bringing sensitivity, specificity, and F-score to unity. Manual inspection revealed that the neuro-symbolic AI correctly identified one PTS patient mislabelled as rPC by physicians, giving a sensitivity of 0.91, specificity of unity, and F-score of 1 (Table 3).

Table 3 Comparison of GPT-4-only, Neuro-symbolic AI, and Physician-generated reference for discerning primary tumour staging (PTS) reports and pathological reports mentioning recurrent prostate cancer (rPC)

Table 4 lists each PTS patient that GPT-4 missed. Observed failure patterns were (i) Trigger-word bias (ii) Missing temporal reasoning (iii) Hallucinated clinical context when phrasing was ambiguous. Because Plato-3 re-evaluates every structured fact against explicit guidelines and stored facts, all six misclassifications were corrected, so the combined neuro-symbolic system produced the right PTS label in every case.

Table 4 Qualitative analysis of all six PTS reports not detected by GPT-4

Pathological reports

Without Plato-3, GPT-4-only correctly classified 176 PET/CT reports as pathological and 13 as healthy but misclassified 16 cases as pathological and 1 case as healthy, resulting in a sensitivity of 0.99, specificity of 0.45, and F-score of 0.95. In all 16 false positive cases, GPT-4 answered ‘no’ to specific questions about primary tumours or metastases but speculated a pathology due to clinical history. The neuro-symbolic AI made no mistakes and achieved a sensitivity, specificity, and F-score of unity. In the previously 16 false positive cases, GPT-4 reverted its opinion after receiving Plato-3’s accordant chain of reasoning. Furthermore, the neuro-symbolic AI identified one false negative and one false positive pathology classification in the physician-generated reference, resulting in a sensitivity of 0.99, specificity of 0.97, and F-score of 0.99 (Table 3).

PSA-level and other study parameters

GPT-4-only made four mistakes in the development set and zero mistakes in the validation set when identifying the latest PSA-level, resulting in an overall agreement with the reviewed reference of 98.1%. With Plato-3 re-prompting GPT-4, the neuro-symbolic AI improved agreement to 100%. Mistakes in identifying the correct PSA-level made by GPT-4 were exclusively due to erroneously formatted or written PET reports. Examples include missing measurement units, non-attributable dates, or incorrect designations. These ambiguities affected also the human readers, as the neuro-symbolic AI uncovered seven PSA-level mistakes made by the physicians, giving a human agreement rate of 96.6%. The overall agreement for correctly detecting study parameters compared to the review standard was 94.7 ± 7.1% for GPT-4-only and 98.4 ± 1.9% for the physician-generated reference. The agreement for correctly detecting study parameters that were covered by the ontology (Parameter 2 to 9) was 98.1 ± 2.7% for GPT-4-only, 98.4 ± 1.8% for the physician-generated reference, and 100 ± 0% for the neuro-symbolic AI (Table 1). No significant differences could be observed between the data sets.

Discussion

Our main result is the proof of concept for our approach to realize neuro-symbolic AI. At autonomously structuring and analysing medical reports the neuro-symbolic AI outperformed the unaided LLM (GPT-4) and matched or outperformed trained physicians. GPT-4 alone performed similarly to previously published results that showed over 90% success in text mining6. However, when extracting study parameters controlled by the expert, the combination of GPT-4 and the expert system reached near-perfect accuracy. This was especially apparent for the three main study questions, where inference was required. The expert system’s oversight also ensured confidentiality, reducing privacy risks posed by GPT-4’s distributed nature.

Our results explore how a semantic-neuro-symbolic AI can discern and communicate complex factual issues, such as identifying distinct patient groups or necessary facts for the replication of the reference study42. The correct identification of PTS constituted the most demanding task, requiring the AI to construct multiple fact layers. Checking facts against a dedicated ontology and using a specific subset of natural language removes ambiguity from the workflow and enables GPT-4 to structure PET reports into validated data without specialized training for the task. The human operator is always able to retrace step-by-step the neuro-symbolic AI reasoning for every label, as well as to understand and correct his own mistakes. Figure 3 demonstrates this traceability and shows together with Table 4 how the expert system complemented the LLM in the decision-making process.

A key strength is that we retained every PET/CT report exactly as dictated by three different staff physicians, each using their own phrasing, abbreviations and section layout; this heterogeneity shows that the neuro-symbolic pipeline handles natural language that extends well beyond a single author-specific template. Additionally, using the publicly accessible GPT-4 and not a specifically trained foundation model makes our findings generalizable to other LLMs.

Artificial intelligence will be introduced in healthcare, and decision traceability and fact checking becomes crucial due to the high importance of accountability. Especially, the correct identification of wrongly anonymized data and missing disclaimers highlights the need for prudence when using stochastic ‘black box’ technologies like LLMs. This caution arises from the potential risks to patient wellbeing and privacy due to opaque decision-making processes and unclear data handling by unsupervised autonomous software19.

Most approaches to combat a LLM’s stochastic output focus on either fine-tuning the model with specific medical data57, giving access to external databases, or embedding symbolic knowledge directly into neural networks27,58, which limits these systems to specific, narrowly-defined tasks. In contrast, our validation study uses the RUDS platform that enables interaction between different programming paradigms and thus the generic interfacing of different AI concepts into a single neuro-symbolic system. Instead of tightly coupling symbolic AI within the neural network, RUDS allows symbolic and neural engines to prompt each other dynamically and call upon additional AI models or traditional software as needed. This broadens the scope of problems such a combined AI can address, moving beyond task-specific applications to an ‘artificial expert’ capable of handling a wider range of challenges.

Generic interfacing of fundamentally different AI and software types is nearly impossible without paradigm integration. Paradigm integration itself became possible because the required progress for understanding and supporting composition operations within programming environments46 has been finally met. As a result, we have developed interfaces suitable for paradigm unification and the realization of the cognitive computing34 and neuro-symbolic AI paradigm. Our approach mirrors real-world processes running at different abstraction levels28, simplifying them into manageable components, while maintaining core functionalities of its individual modules29. This contrasts with contemporary multi-paradigm software, which use programming paradigms mostly in an isolated manner to address very specific problems.

Cognitive computing, which imparts meaning to data from context and intention, is ideal for understanding, designing, and controlling complex systems that handle heterogeneous data and exhibit unexpected behaviour34,43. While solving complex problems, such as compiling studies from unstructured medical data can be difficult, validating the problem’s solution is generally easier. Therefore, our neuro-symbolic AI uses the LLM for problem solving but employs a rule-based expert system to control input and validate the LLM’s outputs against its human designed ontology. Compared to simply using a knowledge base to augment an LLM57, our expert system with its ontology uses recursive predicate logic59 to verify and refine decisions60. This enables iterative reasoning, self-referencing61, and the handling of emergent properties. Furthermore, creating an ontology effectively lessens the need to train specific AI foundation models and allows using boilerplate LLMs.

Contrary to standalone generative AI, our neuro-symbolic AI can be allowed to self-modify62 and to acquire new knowledge from unstructured sources like medical textbooks or scientific papers while working on its tasks. We are currently exploring this capability to have GPT-4 write new rules into the ontology, while the expert system constructs thereof new prompts for GPT-4, solving the ontology-scaling problem63.

Three caveats must be noted. First, the ontology was tailored to a small local data set, lacking rules for some parameters, such as pathology localization and lesion quantity. However, Plato-3 recognizes and communicates unknowns, handling such cases by using fall-back rules. Second, GPT-4 receives continuous updates, so the results reflect the system’s abilities at a particular time. The expert system’s ontology, however, rectifies output regardless of LLM updates. Third, mistakes undetected by the neuro-symbolic AI in the original physician-generated reference would be carried over to the new reviewed reference, i.e. the new ground-truth. Also, some circular-validation risk remains whenever an AI may outperform its human benchmark. To curb that risk we accepted a revision only when the expert system’s rule trace and the adjudicator’s reasoning independently agreed.

Currently under development is a method for giving direct access to a workstation and using a multi-modal, vision-capable LLM, where our system can enter data directly into trial forms without changing an existing workflow. Using an expert system for auditing LLM decisions also offers the chance for LLM applications to pass medical device safety regulations, such as receiving a CE marking64. However, this does not absolve future work from including multi-centre data to test robustness across institutional reporting styles and to check against possible demographic bias introduced by the LLM.

Nevertheless, the lessons learned from this proof of principle are already applicable to large multicentre clinical trials: With our paradigm-integrating methods, language barriers, idiosyncrasies, or incompatible information technology no longer hinder evaluations of transnational datasets. Furthermore, LLMs under the transparent control of an expert system can be applied wherever humane rules and values must be respected19, such as indexing electronic patient dossiers or supervising clinical workflows5.

Conclusion

We conclude that our work offers a solution to errors and omissions, lack of transparency, and privacy risks of generative AI. Our rule-based quality control of LLMs permits their safe use structuring free-text from radiological reports. Hyper-exponential growth of AI technology will increasingly integrate AI into human-cantered tasks like health services, and future AI will likely instruct people rather than merely assisting them4. Having comprehensible AI decisions will therefore be crucial. Our work prepares for this paradigm shift by incorporating controlling and auditing mechanisms into autonomous AI systems, towards addressing the recognized needs for transparent, fair, robust, and ethical AI40.