Introduction

Protein–protein interactions (PPIs) are crucial in various physiological processes, such as gene expression, signal transduction, and apoptosis, which directly impact health and disease1. Aberrations in PPIs, induced by factors like mutations or infections, are strongly linked to diseases such as cancer2,3. Moreover, PPIs influence a variety of industries, as can be seen in their role in food processing where enzymes like chymotrypsin are essential for breaking down wheat gluten proteins4, and in agriculture where protein interactions affect fruit ripening5. The surge in biomedical literature, exemplified by PubMed’s extensive database, presents a challenge to researchers to efficiently mine and extract actionable insights. To help address this opportunity, advances in Natural Language Processing (NLP), such as Named Entity Recognition (NER), Relation Extraction (RE)6, and Question Answering (QA)7, offer significant potential. The introduction of transformer-based models like BERT8 in 2018 revolutionized this field and have led to specialized versions like BioBERT9, SciBERT10, and Clinical BERT11, enhancing task-specific model performance.

Early methodologies for extracting Protein-Protein Interactions (PPIs) from the literature included pattern-based, co-occurrence, and machine learning strategies. Pattern-based techniques, which involve manually constructing rules to identify protein pairs, provide a foundational approach12,13, while co-occurrence methods leverage occurrences of protein pairs within single sentences to facilitate extraction, albeit with limitations in their capacity to capture complex relationships14. Advanced approaches like machine learning use comprehensive feature sets that combine linguistic and structural elements to enhance PPI extraction, though these methods often face challenges related to structural similarities within the data15,16. Furthermore, kernel-based techniques such as sub-tree17, subset tree18, partial tree19, spectrum tree20, and feature-enriched tree21 kernels effectively use high-dimensional sentence features to optimize the extraction process22,23; and composite kernel methods refine this approach by integrating multiple kernel types to enhance the extraction and analysis of textual information24,25,26. Additionally, the development of the gradient-tree boosting model, LpGBoost, marks a significant advance in computational efficiency through its use of shallow data representations27. Moreover, neural network architectures, particularly convolutional (CNN) and recurrent neural networks (RNN), play a pivotal role by facilitating feature extraction autonomously; this augments kernel methods and improves the overall efficacy of PPI extraction tasks28,29,30. The integration of transformer-based BERT models represents a further evolution in PPI extraction methodologies. These models incorporate deep learning techniques with lexical and syntactic processing to significantly advance the field and enable more precise and effective extraction of protein interactions31,32. However, despite the effectiveness of domain-specific models, their broader application and effective implementation are often limited by the requirement for foundational computer science knowledge. In contrast, the development of large language models (LLMs), such as OpenAI’s ChatGPT introduced in 2022, provides a promising alternative. These models, pre-trained on a hugely wide range of datasets, are capable of generating contextually relevant responses across various domains without the need for task-specific fine-tuning33. Moreover, the LLMs in OpenAI’s Generative Pre-trained Transformer series have evolved from GPT-1 to the more robust and multimodal GPT-4, which demonstrate extensive capabilities in understanding and generating language34,35. These advanced models can seamlessly support a range of applications without requiring domain-specific adjustments36,37.

Compared to open-source language models, which require technical expertise in computational methods for training and prediction, proprietary LLMs such as ChatGPT and Gemini38 enable non-technical users to efficiently obtain domain-specific reference results through well-designed prompts. This capability can significantly accelerate research processes in fields such as pharmaceutical development39. In this study, we use the protein-protein interaction (PPI) prediction task as a case study to evaluate the performance of proprietary LLMs in responding to specifically designed prompts, to simulate the extent of assistance available to biomedical researchers. Our objective is not only to assess the intrinsic capabilities and limitations of these models in specialized tasks but also to explore the extent to which prompt optimization can enhance their performance. In summary, this analysis aims to provide a clearer perspective on the applicability of proprietary LLMs in handling complex biological data, and in so doing, offer valuable insights into the practical deployment of AI technologies in the life sciences.

Methods

Protein–protein interaction (PPI) task definition

The definition of protein-protein interaction (PPI) can vary depending on the conceptual delineation of entities required for different applications. Since being established in the 1970s40, the interdependence of DNA, RNA, and proteins has been well recognized. However, the gene-centered representation commonly adopted in biomedical literature tends to exacerbate the ambiguity among these molecular entities. To improve the accuracy of biological interaction modeling, many studies have expanded the definition of protein entities to include genes and RNA41,42,43,44. Consequently, PPI datasets differ in their definitions of protein entities based on their intended applications. Moreover, the definition of interactions between proteins spans multiple perspectives, ranging from direct physical contact to broader contextual associations described in textual sources. To mitigate error propagation caused by sequential entity recognition and relation prediction, most PPI datasets provide pre-annotated entities, which facilitate subsequent modeling efforts. In this study, we adhere to the entity annotations provided by each dataset and conduct sentence-level predictions. During the prompt development, the model is presented with a pair of named entities and sentences containing these entities, and it is tasked with predicting whether the protein pairs interact. Sentences containing interacting entity pairs are then classified as positive instances, whereas those without interactions are categorized as negative instances. If a sentence contains multiple entity pairs, each pair’s relationship is assessed separately. For example, if SigK and GerE are determined to have no interaction, the instance is labeled as negative, whereas if SigK and ykvP are identified as interacting within the same sentence, it is labeled as positive. This is illustrated in the examples provided in Supplementary Table 1.

Evaluation dataset and experiment settings

We employed six PPI benchmark datasets to evaluate our prompting strategies: LLL, IEPA, HPRD50, AIMed, BioInfer, and PEDD. The LLL dataset, which originated from the Learning Language in Logic 2005 (LLL05) challenge and was sourced from the Medline database, focuses on extracting protein/gene interactions with only 77 sentences, i.e., a limited quantity of information45. The IEPA dataset comprises 486 sentences extracted from PubMed abstracts46. The HPRD50 dataset includes 50 random abstracts from the Human Protein Reference Database (HPRD), totaling 145 sentences with proteins/genes pre-tagged by ProMiner and interaction relationships annotated by human experts13. The AIMed dataset contains 200 abstracts from PubMed, manually annotated with entities and their interaction relationships47. The BioInfer dataset comprises 1100 sentences and was collected using the Database of Inter-acting Proteins (DIP) to identify PubMed search inputs related to interacting entities; the selected sentences contain more than one pair of interacting entities48. We also used the recently published PEDD dataset, which was derived from the AICUP 2019 competition and focuses on abstracts in PubMed published after 2015 and from journals with impact factors greater than 5 to ensure higher-quality and more recent information49. The original distribution of positive and negative PPI instances across these datasets is presented in the “Raw data” column of Supplementary Table 2. It’s worth noting that since a single sentence may contain both positive and negative instances, the total sentence count is significantly less than the sum of positive and negative instances.

Our experimental workflow followed a two-phase evaluation approach. In the first phase, we selected representative samples from the five benchmark datasets (LLL, IEPA, HPRD50, AIMed, and BioInfer) for prompt design and initial evaluation using GPT-3.5 and GPT-4. In the second phase, we conducted comprehensive performance evaluations using all six datasets (including PEDD) across multiple proprietary LLMs, including GPT-3.5, GPT-4, and Gemini 1.5 (both Flash and Pro versions). All experiments were implemented in Python using the corresponding LLM APIs. To ensure consistent evaluation, we required all LLMs to output their predictions in a standardized JSON format, which helped the systematic calculation of the three standard metrics we employed: recall, precision, and F1-score. Precision measures the accuracy of positive predictions by calculating the proportion of true positive cases among all instances identified as positive; this indicates the model’s ability to avoid false positives. Recall quantifies the proportion of true positives that the system correctly recognizes and thus reflects the model’s ability to capture all relevant interactions. The F1-score provides a balanced measure of the model’s overall performance by combining precision and recall into a single value. To investigate whether providing additional context could improve LLM-based PPI extraction, we experimented with multi-sentence inputs (ranging from 1 to 5 sentences). While adding contextual information might theoretically enhance the model’s ability to infer interactions, it also introduces the possibility of additional noise, potentially affecting prediction reliability. Our results showed that multi-sentence inputs led to performance fluctuations across different models and datasets, indicating that longer inputs do not consistently improve predictive accuracy. To ensure methodological consistency and reproducibility, our primary evaluation was conducted using single-sentence inputs. A detailed analysis of this comparison is provided in the Results section.

Methodological framework overview

This study presents a systematic approach to evaluate the capability of proprietary LLMs in PPI detection through carefully designed prompting strategies. Our methodology comprises three main components: (1) prompt engineering and optimization, (2) context-aware prompt selection, and (3) systematic evaluation across multiple datasets. In the prompt engineering phase, we developed six distinct prompting scenarios, progressively increasing in complexity from basic interaction queries to sophisticated entity-tagged formats. These scenarios were designed to assess how different levels of input structure and contextual information affect model performance. The prompts vary in two key dimensions: the degree of entity tagging (from untagged to comprehensively tagged with numerical identifiers) and the specificity of query statements (from simple interaction queries to structured JSON output requests). The context-aware selection mechanism enables adaptive prompt deployment based on sentence characteristics. This component specifically addresses the challenges posed by various protein entity representations in the biomedical literature, including complicated entities (e.g., "Arp2/3 complex"). The selection process ensures that the most appropriate prompt is applied to each specific context. For systematic evaluation, we implemented a comprehensive workflow that processes PPI datasets sequentially, incorporating pairwise protein labeling and context-specific prompt selection. This structured approach allows for consistent assessment across different datasets while accommodating their unique characteristics. The following sections detail these components, beginning with an in-depth examination of the prompting scenarios, followed by the advanced workflow that integrates these prompts into a coherent PPI detection system.

Prompting scenarios

To systematically evaluate PPI extraction, we designed a comprehensive prompting strategy as illustrated in Fig. 1. The strategy categorizes input data into two primary types: Basic Entities and Nested/Complex Entities. For Basic Entities, we developed three distinct entity tagging strategies: (1) No Additional Tagging, where raw text is processed directly; (2) Tag Only One Protein Pair, which focuses on specific protein pairs of interest; and (3) Tag All Protein Entities with Repeated Proteins Assigned Numbers, which provides comprehensive protein markup. These tagging strategies are coupled with progressively complex query statements: from basic interaction verification (Prompt 1), to identifying PPIs without explicit protein information (Prompts 2-3), to full protein-aware PPI identification (Prompts 4-5). For Nested/Complex Entities, we implemented a specialized Pairwise Gene Tagging approach (Prompt 6) to handle compound entities with special characters. More detailed processing methods for these complex cases will be introduced later. All tagged entities use the "<GENE></GENE>" markup schema to standardize the identification of protein entities across different prompting scenarios.

Fig. 1
figure 1

Prompting strategies for PPI identification.

Supplementary Figure 1 showcases examples of input-output pairs for prompts 1 and 3, where ‘USER’ is the designated input and ‘ASSISTANT’ the resulting output from the GPT models. In the sentence 'LLL.d2.s2’, three protein entities are present, which leads to three pairs of input questions generated through permutation and combination. Example (a) in Supplementary Figure 1 illustrates a single prompt pair for prompt 1, while Example (b) demonstrates the application of prompt 3 for the same sentence. Given that prompts 2 and 3 elicit multiple outputs of PPI results from the models, the prompt descriptions have been enhanced to include a request for response in JSON format to simplify subsequent result aggregation and analysis. Descriptions underscored on the input side in the figure show the disparities between prompt 3 and prompt 2. By enumerating all protein entity names in the sentence, the aim is to assess whether the model can enhance its association judgment capability. The most intricate prompt, prompt 5, within the same query framework as prompt 3, entails explicit and comprehensive entity tagging and numbering for the input sentence. Figure 2 illustrates the schematic representation of input sentence tagging.

Fig. 2
figure 2

Schematic representation of tagging in protein entities.

To assess the efficacy of all prompts, we manually selected ten sentences containing protein entities from each of the five PPI benchmark datasets. The selected sentences met the criteria of generating at least 3 to 6 positive/negative instances to maintain a homogeneous distribution in the sample set. The statistical data of the sample dataset are presented in Figure 2 of supplementary materials. The sample dataset was applied to both the GPT-3.5 and GPT-4 models to select the most reliable prompt.

During the data examination, we established a systematic approach to address complex entity scenarios that extend beyond basic predefined prompts. Our analysis revealed three primary patterns requiring specialized handling: symbol-based compound entities, keyword-indicated complexes, and nested entities. First, we implemented symbol-based filtering to identify compound protein entities. Sentences containing protein names with "/" or "-" symbols were flagged as compound protein entities. For example, "Arp2/3 complex" represents two distinct proteins, “Arp2” and "Arp3," functioning as a unified complex. These marked protein names underwent additional analysis to ensure accurate interpretation by the model. Second, we developed keyword-based identification for complex naming entities. Sentences containing specific keywords such as "complex," "subunit," or “component” were identified as potential compound naming entities. For instance, "The Arp2/3 complex binds actin filaments" exemplifies such cases. These sentences received specialized processing by the LLM to prevent prediction errors arising from abbreviations or naming variations. Third, we addressed nested entity scenarios, where both short-form and descriptive long-form protein names appear within the same context. For example, in "... two transmembrane receptors, the p75 neurotrophin receptor and the p140trk (trkA) tyrosine kinase receptor, ...", both “p75” and its longer form "p75 neurotrophin receptor" appear as nested entities. To avoid manual intervention, we employed a gene abbreviation recognition module to detect and process these nested and compound terms automatically. The prompt instructs GPT to consider complete protein names rather than merely abbreviations during PPI identification.

Statistical analysis of the datasets revealed that these complex entity patterns appear exclusively in the AIMed and BioInfer datasets, comprising approximately 23% of instances within each dataset. To address these scenarios, we developed Prompt 6, which builds upon prompt 1 by incorporating comprehensive guidelines for handling compound terms, nested entities, and their variations. The entity tagging strategy requires listing all protein entity names and examining protein pairs systematically, with sentence processing considered complete only after evaluating all possible protein pair combinations. Figure 3 demonstrates this approach using BioInfer dataset examples, with the underlined text highlighting compound entity descriptions and comparing predictions across different LLM models.

Fig. 3
figure 3

Design of Prompt 6 for handling complex protein entities.

A scenario-based prompting framework for PPI detection

Sentences describing PPIs often vary widely due to differences in research topics, author writing styles, and experimental types, which makes them challenging to categorize. To address these challenges, we developed a comprehensive framework as illustrated in Fig. 4, which begins with prompt design and evaluation using sample datasets. The framework initiates from the prompting scenarios (lower left in the figure, with detailed design principles shown in Fig. 1). In this phase, input sentences are categorized based on entity complexity: basic entities and complex entities (including complex, compound, or nested structures). For each category, candidate prompts undergo evaluation using GPT models to determine the most effective prompt for that specific scenario. The result is Prompt 5 for basic entities and Prompt 6 for complex entities. Following prompt assessment, all PPI dataset sentences undergo scenario classification based on their entity complexity. The sentences are then processed using appropriate marking strategies to highlight gene locations: basic entities undergo gene tagging and numbering, while complex entities are processed through pairwise gene tagging. Finally, these processed sentences are combined with their corresponding scenario-specific prompts for performance evaluation using proprietary LLMs including GPT and Gemini models. This systematic approach ensures appropriate handling of various PPI descriptions while optimizing prompt selection based on entity complexity and sentence structure. The performance of the predictions can be comprehensively evaluated through this structured methodology.

Fig. 4
figure 4

PPI prompting framework.

Results

Determination of candidate prompts

The primary objective of the experiment was to ascertain the optimal general prompts (Prompts 1–5) to help the model discern the different PPIs. To achieve this, we employed the PPI sample dataset delineated in Supplementary Figure 2 as the focal point for evaluation and conducted performance assessments of general prompts across both the GPT-3.5 and GPT-4 models. The findings, shown in Table 1, consistently demonstrate that Prompt 5 exhibited superior performance within the sample dataset across both model iterations. According to Supplementary Figure 3, the design strategy of Prompt 5 helps GPT accurately identify multiple proteins with the same name during PPI prediction and subsequently lists the actual interacting groups based on content associations. By subsequently leveraging the efficacy demonstrated by Prompt 5, we investigated the impact of inputting multiple sentences simultaneously on the performance of the GPT models (Table 2). The findings indicate demonstrate not only GPT-4’s superior performance over GPT-3.5 but also a more consistent effect on model performance with the single-sentence input methodology.

Table 1 Performance (in %) of general prompts (P/R/F).
Table 2 Prompt 5 performance (in %) of multiple-sentence input (P/R/F).

To evaluate the performance of Prompt 6, which is specifically designed to handle cases involving nested protein named entities, we created an evaluation sample set. This set consists of 10 sentences each from the AIMed and Bioinfer datasets. In the AIMed dataset, there are 15 positive instances and 27 negative instances, while the Bioinfer dataset contains 42 positive instances and 30 negative instances. This sample composition allows for a balanced assessment of the prompt’s effectiveness in identifying relevant entities. Given their similar query statement structures, Prompt 1 served as the baseline for evaluation. The results presented in Table 3 indicate that Prompt 6 outperforms Prompt 1 in both GPT-3.5 and GPT-4 models. This highlights the beneficial impact of providing compound term information on the enhancement of large language model performance.

Table 3 Performance (in %) of complicated prompt design (P/R/F).

Performance evaluation of PPI datasets

Based on the previously established prompt filtering process, we evaluated the performance of GPT-3.5, GPT-4, and Google Gemini 1.5 (both flash and pro versions) across six PPI datasets using the final Prompts 5 and 6. The “Raw data” column of Table 4 presents comparative results across all models. It was notable that Gemini 1.5 pro demonstrated superior performance across most datasets, achieving the highest F1-scores in the LLL (90.3%), IEPA (68.2%), HPRD50 (67.5%), and PEDD (70.2%) datasets. GPT-4 showed competitive performance particularly in the LLL dataset (87.3% F1-score), while generally maintaining stable performance across the other datasets. GPT-3.5 demonstrated consistent but relatively lower performance compared to the other models. The variation in model performance across datasets reflects the inherent challenges in their characteristics. The LLL dataset, despite being the smallest, yielded the highest performance across all models (ranging from 79.2 to 90.3% F1-score), which suggests that its smaller size and potentially more straightforward entity relationships facilitate better model comprehension. In contrast, more complex datasets like AIMed, BioInfer, and PEDD showed lower performance across all models, with F1-scores typically ranging from 37.9 to 70.2% for the raw data. This performance pattern highlights the challenges posed by complex entity features and rich content in these datasets.

Table 4 Performance (in %) on PPI datasets (P/R/F).

Furthermore, we identified a bias in all models towards predicting positive input instances, which led to a diminished discriminatory capability when processing exclusively negative instances. To address this limitation, we excluded sentences containing solely negative instances from the datasets during performance evaluation. The “Refined data” column of Table 4 demonstrates the models’ performance after this modification. The LLL dataset, lacking such instances, maintained its original distribution and was excluded from refined evaluation. In this scenario, all models showed improved precision and F1-scores across the refined datasets. For instance, in the IEPA dataset, Gemini 1.5 pro achieved an F1-score of 89.7%, while GPT-4 and GPT-3.5 reached 86.4% and 84.7% respectively. This consistent improvement across models and datasets further validates our observation regarding the models’ positive instance bias and demonstrates the effectiveness of our refinement strategy.

Discussion

The application of LLMs to biological relationship extraction presents both promising opportunities and notable challenges. Our comprehensive evaluation of LLMs in PPI prediction reveals several key insights regarding model performance, limitations, and practical implications for the biomedical research community. In this discussion, we first systematically examine the relationship between model performance and dataset characteristics, followed by a detailed analysis of model-specific results. We then address key limitations and challenges before exploring future directions and potential applications. Our analysis demonstrates varying levels of effectiveness across different models and datasets, with particularly notable performance patterns. The most striking observation is the consistently superior performance achieved on the LLL dataset across all models, with F1-scores ranging from 79.2 to 90.3%, and Gemini 1.5 Pro achieving the highest F1-score of 90.3%. This exceptional performance can be attributed to several factors, including LLL’s smaller size, straightforward annotation scheme, and simpler grammatical structures, which provide a less challenging environment for model comprehension50. These dataset-specific characteristics form the foundation for understanding the broader patterns of model performance across our evaluation suite.

The performance variations across datasets reveal critical insights into model behavior and limitations. Beyond quantitative performance metrics, our linguistic analysis uncovered significant challenges in semantic parsing and referential understanding. We observed that LLMs demonstrate notable susceptibility to misclassification when confronted with complex interaction-related linguistic structures. Two illustrative examples underscore these interpretative challenges. In the first instance, consider the sentence excerpted from BioInfer.d245 "Here we have now identified the reciprocal complementary binding site in alpha-catenin which mediates its interaction with <GENE>beta-catenin</GENE> and <GENE>plakoglobin</GENE>." Lexical cues such as "mediates" and "interaction" precipitated a model-driven misinterpretation, erroneously suggesting a direct PPI relationship between beta-catenin and plakoglobin. Critically, the models overlooked the nuanced reality that alpha-catenin separately interacts with these entities. The referential narrative structure, particularly the pronomial phrase "its interaction," represents a significant interpretative vulnerability for contemporary LLMs. A second example, excerpted from AIMed.d182, further highlights the challenges of referential disambiguation in LLMs. Specifically, the sentence “The MDM2 oncoprotein is a cellular inhibitor of the <GENE>p53</GENE> tumor suppressor in that it can bind the transactivation domain of <GENE>p53</GENE> and downregulate its ability to activate transcription” demonstrates how repeated gene mentions require accurate coreference resolution to preserve semantic clarity. In this case, the two p53 entities plainly lack a protein-protein interaction. The pronoun “it” unambiguously references the sentence’s primary subject, "MDM2 oncoprotein". Nevertheless, the models consistently failed to discern these critical contextual nuances, erroneously predicting an interconnection. The marked contrast between simpler datasets like LLL and more complex ones such as AIMed and BioInfer highlights how sentence complexity and grammatical structure significantly impact extraction accuracy. This is particularly evident in AIMed’s sophisticated sentence constructions and nested entities, which resulted in notably lower precision scores of 26.6% and 23.8% for GPT-3.5 and GPT-4, respectively. The representation of biological entities across different datasets also proves to be a crucial factor, with BioInfer’s complex compound entities and abbreviations significantly challenging model comprehension, as reflected in its lower F1-scores ranging from 45.3 to 55.3% in the raw data. In contrast, datasets featuring more straightforward entity mentions, such as HPRD50, demonstrated more robust performance with F1-scores between 63.6 and 67.5%. Additionally, our analysis uncovered a consistent bias across all models toward predicting positive instances, leading us to refine our evaluation approach by excluding sentences containing solely negative instances. This refinement resulted in substantial performance improvements, as evidenced by IEPA’s F1-scores increasing from 65.6 to 84.7% for GPT-3.5 and from 64.0 to 86.4% for GPT-4; this underscores the significant impact of data distribution on model behavior.

Gemini 1.5 Pro demonstrated superior performance across most datasets, which can be attributed to its advanced model architecture and optimization. This superior performance was particularly evident for the LLL (90.3% F1-score), IEPA (89.7% F1-score in refined data), and HPRD50 (87.1% F1-score in refined data) datasets. GPT-4 showed competitive performance, particularly in the LLL dataset (87.3% F1-score), while maintaining stable performance across other datasets. GPT-3.5, while showing consistent performance, generally achieved lower scores compared to other models. To contextualize these results within the current research landscape, a recent study by Rehana et al.51 evaluated GPT and BERT models on PPI tasks, achieving F1-scores of 86.49% (GPT-4 on LLL), 71.54% (GPT-4 on IEPA), and 65.0% (GPT-4 on HPRD50) using strategies such as protein dictionary normalization and protein masking. While their approach relies on sophisticated preprocessing techniques, our prompt design approach achieves comparable or superior results while focusing solely on sentence structure, and in so doing, offers a more user-friendly approach for non-technical biomedical researchers. Further analysis of the LLMs’ behavior in specific linguistic contexts revealed additional performance patterns. We examined PPI sentences from the PEDD dataset containing inference-related keywords including whether, may, might, could, potential, and other similar terms; we identified 2,570 cases with 282 positive and 2,288 negative instances for evaluation. In these inferential contexts, Gemini 1.5 Pro maintained its superior performance with an F1-score of 0.2, followed by Gemini 1.5 Flash at 0.179 and GPT-3.5 at 0.163, while GPT-4 unexpectedly showed the lowest performance at 0.154. The significantly lower F1-scores in inferential contexts compared to general PPI detection suggest that LLMs struggle to interpret protein interactions when presented with hypothetical or uncertain relationships. Notably, all models exhibited high false positive rates in inferential contexts, with even the best-performing Gemini 1.5 Pro producing 1,550 false positive cases. This observation provides additional evidence of a broader pattern in LLMs’ prediction behavior that warrants further investigation, as we discuss in our analysis of the study’s limitations.

While these results demonstrate promising capabilities of LLMs in biomedical relationship extraction, several important limitations and challenges emerged from our analysis. The observed bias of LLMs towards positive PPI predictions warrants careful consideration. A comprehensive review discusses how biases can manifest in various natural language processing tasks, including literature mining, highlighting the tendency of LLMs to generate outputs that may favor certain narratives or perspectives52. The ability to accurately identify both the presence and absence of protein interactions is crucial for understanding biological systems and developing therapeutic interventions. Our findings suggest that current LLMs, despite their sophisticated architecture, may exhibit a systematic bias that could lead to false positive predictions in PPI detection tasks. This limitation could be particularly problematic in exploratory research where identifying non-interacting protein pairs is as valuable as detecting interactions. Our analysis also revealed interesting variations in how different LLMs identify potential PPIs within established complexes. For example, when analyzing the sentence "Arp2/3 complex from Acanthamoeba binds profilin and cross-links actin filaments" and subsequently asked whether Arp2 and Arp3 components interact with each other, we observed divergent interpretations: GPT-3.5 identified a positive PPI between Arp2 and Arp3 subunits within the complex, while GPT-4, Gemini 1.5 Flash, and Gemini 1.5 Pro classified it as a negative interaction. This discrepancy highlights a fundamental challenge in how LLMs process relationships between proteins that function as part of established complexes. The Arp2/3 example demonstrates that even when addressing interactions between components of well-characterized molecular assemblies, LLMs can produce inconsistent results. This variation represents an important consideration when employing LLMs for biomedical knowledge extraction, particularly for questions about the internal structure of protein complexes that may not be explicitly stated in the text.

Traditional transformer-based models (BioBERT, PubMedBERT, SciBERT) have demonstrated superior performance across these datasets, with BioBERT achieving F1-scores ranging from 74.95% (HPRD50) to 86.84% (LLL)51. Similarly, previous machine learning approaches have shown competitive results, with sdpCNN achieving 66% and 75.2% on AImed and BioInfer respectively53, and the tree-based DSTK model achieving scores ranging from 71.01% (AImed) to 89.20% (LLL)23. Based on these findings and observed limitations, we recommend that practitioners implementing LLMs for PPI detection should: (1) implement additional validation steps for negative predictions, (2) consider incorporating confidence scores or uncertainty measures in their predictions, and (3) potentially employ ensemble approaches that combine LLM predictions with traditional machine learning methods that have shown robust performance in identifying negative instances. Looking toward future developments, while the system testing results obtained through prompt design may not surpass various state-of-the-art deep learning models, LLMs offer unique advantages. Their ability to adapt prompt suitability continuously without retraining for specific task demands54 and potential to outperform state-of-the-art models when fine-tuned on specific datasets55 makes them particularly valuable for non-computer science experts56. Recent research has shown promising strategies for improving LLM performance in biomedical applications. Few-shot learning approaches have demonstrated comparable results to SOTA models in tasks such as NER, relation extraction, summarization, and QA55. In the Nephrology domain, chain-of-thought (CoT) strategies have successfully guided models in clinical decision support57. Future development of LLMs for biomedical applications should explicitly address the positive prediction bias, perhaps through specialized pre-training strategies or architectural modifications that better balance the model’s ability to identify both positive and negative protein interactions. LLMs have already begun to appear in highly specialized clinical research, such as extracting diagnosis information from cancer examination reports58,59. Their performance on the recent PEDD dataset (achieving 70.2% F1-score compared to BioBERT’s 77.06%)49 demonstrates their ability to handle contemporary biomedical literature effectively, though precise task completion still relies heavily on appropriate prompt guidance60. In an era emphasizing multi-objective applications, the integration of large language models can provide a robust foundation for development and research across various domains. Our findings suggest that when combined with effective prompt engineering strategies and domain-specific considerations, LLMs have the potential to revolutionize biomedical relationship extraction, particularly for researchers without extensive computational expertise.

Conclusion

Verifying interactions in biomedical experiments requires precise conditions and results in complex biomedical literature and PPI datasets. This study shows that general-purpose LLMs like the GPT and Gemini models can reliably predict protein interactions and serve as valuable tools for non-experts. Using progressive prompt designs tailored to PPI dataset specificities, we addressed model biases, such as misclassifying negative instances as positive, by refining data preprocessing and prompt design to significantly enhance performance. Although the GPT and Gemini models currently lag behind traditional state-of-the-art methods, the rapid advancement of LLM technology offers a promising future. Innovative techniques, such as chain-of-thought prompts and ensemble predictions with multiple LLMs, are expected to further improve performance. The continued evolution of LLMs holds significant potential for advancing research across an ever diversifying range of fields.