Introduction

Large language models (LLMs) are rapidly being integrated into healthcare1,2,3,4,5,6, with the potential to significantly transform clinical workflows, enhance decision support, and streamline administrative processes7. Recent studies highlight the promise of these advanced AI tools to improve efficiency8, reduce clinician burden9, and potentially enhance patient outcomes through more effective information management and communication support9,10,11,12. However, the vast majority of existing research has focused primarily on physicians and clinical personnel13, when physicians represent only one in 25 employees in U.S. healthcare14. There remains limited empirical evidence regarding the specific LLM applications most commonly utilized by non-clinician healthcare staff, such as administrative assistants, case managers, and interpreters15.

Without clear insights into how non-clinicians interact with these tools, healthcare organizations face difficulties in optimizing implementations16, ensuring patient safety, and proactively managing emerging risks associated with AI-driven technologies15,17. To address this critical knowledge gap, our study provides a quantitative analysis of non-clinician usage logs from a secure LLM deployment within an academic medical center. This research aims to equip health system leaders, policymakers, and technology developers with actionable insights to maximize the benefits and mitigate the risks of integrating LLMs into health systems.

Results

Usage Trends Across Categories

A total of 30,503 chat threads were analyzed, with 26,691 threads originating from frequent users. Ten primary categories of LLM usage were identified: email and document writing, text manipulation, brainstorming, general information, medical questions, technical support, patient communication, coding, language translation, and image generation. These categories, along with their definitions and representative examples, are detailed in Table 1.

Table 1 List of categories with associated definitions and examples

Analysis of usage patterns demonstrated that ‘Email & Document Writing’ tasks dominated, representing nearly 53.9% of total activity. Other prominent categories included ‘Text Manipulation’ (9.1%), ‘Brainstorming’ (6.7%), ‘Handle Information Requests’ (6.1%), and ‘Answer Medical Knowledge Questions’ (5.9%). ‘Technical Support & IT Issues’ (4.4%), ‘Patient-Provider Messaging’ (3.5%), ‘Coding’ (2.6%), ‘Process Referrals/ Documents’ (2.0%), and ‘Generate Visual Aids’ (0.8%) comprised the remainder of usage (Fig. 1).

Fig. 1: Chat tool usage by task category, mapped to the MedHELM task taxonomy.
Fig. 1: Chat tool usage by task category, mapped to the MedHELM task taxonomy.
Full size image

Horizont albars indicate the percentage of conversation threads assigned to each task category (x-axis, distribution of prompts [%]). Boxes on the left depict the corresponding MedHELM category and subcategory for each task, with connecting lines indicating the mapping.

Summary of User Role Categories in the Study Sample

Of the users who signed in to their department and had a role available to our system (comprising 97.2% of total users), 98.0% were non-clinicians. In this study, the term ‘department’ referred to distinct clinical entities, typically different ambulatory clinic locations or clinical services, that require unique builds within the electronic health record. Please see the Supplementary Information (Supplementary Fig. 2) for details regarding the distribution of user role categories, which were manually mapped to high-level U.S. Standard Occupational Classification (SOC) Major Groups18. The complete taxonomy mapping of usage categories to the MedHELM framework is detailed in Supplementary Table 2 (see Supplementary Information C for advanced practice provider and allied health professional definitions).

Illustrative Examples from Quantitative Usage Categories

In addition to quantifying aggregate usage patterns, examples of user queries were reviewed that depict the potential value and risks of secureLLM deployment. An example of this being requests to automate frequently repeated administrative tasks, such as:

“user: please help me write a brief summary of what a spider angioma is for patients in my pediatric dermatology clinic”

“user: ask me questions to better generate a script for our schedulers to inform families of our wait times before scheduling.”

However, our analysis also surfaced conversations largely unrelated to work or organizational goals. These included:

“user: guess that baby shower game tagline”

“user: can i know the price of the stock tesla 2025?”

“user: what is a good joke for today?”

Despite the majority of user roles being non-clinical, a significant proportion of prompts were related to clinical decision making. Examples included:

“user: What are the symptoms for RSV?”

“user: Patient taking high dose oxcarbazepine, valproic acid medium dose valproic acid and cenobamate. Presents with constipation and abdominal pain as well as decreased appetite. What is the differential diagnosis?”

“user: Write a letter of medical necessity for patient to get Monogenic Hypertension Evaluation testing through Athena Diagnostics.

“user: write letter for appeal to CVS Caremark to approve Retacrit for patient.”

Another use case was related to the generation of insurance prior authorization requests, detailing medical necessity of certain treatments.

Discussion

This study contributes an early quantitative analysis of real-world chat tool use among non‑clinician healthcare staff, addressing a gap in the existing literature that has focused almost exclusively on clinicians. Our analysis of 30,503 chat threads moves beyond prior survey-based or pre-categorized approaches5,6, revealing how non-clinicians adopt and integrate these LLM tools.

Usage was dominated by tasks, such as email composition and document preparation, suggesting considerable untapped potential for more sophisticated applications. Expanding staff awareness of the range of workflow-relevant functions, beyond traditional search or basic writing tasks, alongside training in prompt engineering16 and the development of customized templates for high‑frequency administrative tasks, such as prior authorization letters19, could improve efficiency and reduce administrative burden. However, the presence of prompts requiring nuanced clinical judgment underscores the need for role‑appropriate education and governance20 to prevent out‑of‑scope use.

Unexpected usage patterns included a notable proportion of non-work related personal queries (i.e., trivia queries, creative writing), which at scale may generate unnecessary computational and environmental costs21. In addition, prompts requiring nuanced clinical judgment were observed. This likely represents the small proportion of clinicians engaged with the system despite role-targeted alternatives, or non-clinicians involved in workflows related to insurance. Yet governance and risk-management considerations necessitate further defining non-clinician scope with regard to seeking LLM-derived medical guidance. Several non-clinical uses were identified, including Email and Document Writing, Coding, Technical Support and IT Issues, Text Manipulation, and Brainstorming, that were absent from the MedHELM task taxonomy. This gap illustrates limits in existing frameworks’ ability to capture real‑world LLM uses among non‑clinician staff. We propose the addition of these tasks to MedHELM version 2 under the ‘Administration & Workflow’ category.

These findings highlight the importance of deployment strategies and monitoring tailored to departmental needs. Training priorities will differ, for example, between language translation teams focused on the LLM’s language strengths and information technology (IT) groups leveraging coding assistance features or internal IT workflow troubleshooting capabilities. Role‑based analyses are essential for determining appropriate user education that includes insights into job-specific workflow enhancements, such as prompting techniques for document insight generation or best practices regarding programming assistance.

Limitations include the single‑center design, the 11‑month study period, and potential classification challenges due to multi‑intent conversations. Inconsistent role designations also limited role‑specific analyses. Future research should link usage patterns to departmental context, refine occupational coding, and explore automating high‑volume administrative tasks, such as prior authorization or patient education document creation.

Based on these findings, we offer three targeted policy recommendations for future LLM deployments in healthcare, each linked to specific patterns that were observed:

  1. 1.

    Implement real-time dashboards for usage monitoring. Both directly valuable workflow applications (i.e., automating documentation) and unrelated or inappropriate use (e.g., personal queries, off‑task interactions) were observed. Continuous visibility into usage patterns would allow health systems to quickly identify emerging high‑impact applications worth expanding and to intervene early when unrelated, risky, or non‑compliant prompts are used. We recommend institutions deploy analytics tools within the secure LLM environment to display usage by category, department, and frequency, with automated alerts for unusual trends (i.e., spikes in clinical decision support queries from non‑clinical departments).

  1. 1.

    Develop role‑ and department‑specific, guidance and educational resources. Despite 98% of users being non‑clinicians, a notable subset of prompts involving nuanced clinical decision‑making or insurer‑facing medical necessity statements. Tailoring guidance to user roles would help prevent out‑of‑scope activity, reduce risk, and control compute costs. We recommend organizations map common task categories by department and deliver targeted education that clarifies appropriate use of secure LLMs.

  2. 2.

    Automate validated use cases. Many users repeated specific tasks (such as patient education or insurance-related communications). Automating high-value use cases can enable further efficiency gains. We recommend institutions validate specific workflows by creating automations and templates that encourage use and simultaneously educate users on a broader range of chat tool capabilities.

By systematically monitoring real-world usage patterns and proactively managing risks, health systems can maximize the benefits of LLM integration while safeguarding patient safety and operational efficiency.

Methods

Study Design and Setting

This retrospective, cross-sectional study was conducted at Stanford Medicine Children’s Hospital, leveraging de-identified usage logs of user prompts from a secure, HIPAA-compliant LLM chat tool (GPT-4o, OpenAI). The study period spanned from April 22, 2024 to February 28, 2025 and encompassed a total of 30,503 conversation threads across 239 clinical and administrative roles. For the purposes of this study, the clinician role was defined as one that provides direct patient care in the form of diagnosis, treatment and prescribing (including physicians, APPs, certified nurse midwife, certified nurse practitioner, clinical nurse specialist, and certified registered nurse anesthetist)22. Roles of non-clinicians spanned functions in management, training, nursing, education, administrative support, IT services, and more. For a more detailed breakdown of non-clinician user roles by category, see Supplementary Fig. 2 in Supplementary Information. Of note, most physicians are employed by the affiliated university and are granted access to and encouraged to use a different secure LLM chatbot.

This study was reviewed and deemed exempt by the Stanford University Institutional Review Board (Protocol ID: 81541).

Sample Selection

To focus on users with greater familiarity and engagement with the platform, our primary analysis was restricted to threads generated by “frequent users,” defined as individuals with more than five recorded interactions with the chat tool. This threshold was chosen to ensure an adequate sample size, as very few users had surpassed five interactions early in deployment (20.7% in September 2024) and this proportion increased to 43.7% by the end of the study period. Study size was determined by the number of chat threads available at the time of the data analysis.

Categorization Scheme Development

The categorization scheme for user queries was developed through an iterative process, initially adapting categories from Bedi et al.15 to reflect healthcare-specific workflows. The preliminary category list was comprised of 11 categories, including an “other” designation for uncaptured use cases. Each human reviewer independently applied these labels to a random sample of 100 messages, followed by consensus labeling sessions to resolve discrepancies and refine categories and their associated definitions.

To generate example messages for each category, a secondary LLM was prompted to produce five candidate messages per category. These were then classified by this LLM to ensure alignment, and independently reviewed by two project team members (WH and KB), who selected the three most representative examples for each category. When a conversation was labeled as “other,” the secondary LLM suggested a descriptive category name, which informed subsequent category list iterations. A curated set of examples was subsequently used to guide both manual and automated classification of conversation threads. The final category list, including definitions and representative examples, is detailed in Table 1 and the Supplementary Information.

Classification Process

Classification of user queries employed a hybrid human-AI approach. Three independent physician reviewers (WH, KB, SM) manually labeled random samples of conversation threads. Discrepancies were resolved by consensus or, when necessary, adjudication by a third reviewer. Automated classification was performed using the GPT-4o API. Model prompts and category definitions were iteratively refined based on error analysis and reviewer feedback (please see Supplementary Information A for detailed prompt methodology and categorization examples in Supplementary Table 1). After this process was completed, author KB manually mapped category results to the MedHELM task taxonomy23.

Validation and Reliability Assessment

Detailed validation methodologies and performance metrics are provided in Supplementary Information B, with categorization performance visualized in Supplementary Fig. 1.

Data Handling and Preprocessing

All conversations were preserved in their entirety as single input sequences to the LLM, and included only the user prompts. Threads exceeding 32,000 characters, the maximum row limit for Excel, were truncated which affected a small proportion (1.1%) of the dataset.