Introduction

Clinical Artificial Intelligence (AI) technologies built on electronic health record (EHR) data have demonstrated considerable promise in supporting various tasks such as disease diagnosis and phenotyping, outcome prediction, and clinical decision support1. Numerous clinician-facing models that leverage large-scale clinical datasets have shown efficacy in improving healthcare delivery, operational efficiency, and evidence-based clinical decision-making2. However, most models focus exclusively on EHR-derived patient data (hereafter referred to as patient data), assuming that AI-powered clinical tasks can be sufficiently supported by patient data alone. However, this perspective overlooks the pivotal role of clinician behaviors, which encompass the judgment, attitudes, and actions of clinicians throughout their decision-making processes in patient care. These behaviors not only reflect the collective insights of care teams into patients’ conditions and influence care outcomes, but also provide a critical lens for evaluating clinical AI technologies, thereby helping to ensure these technologies effectively improve patient care without introducing unintended consequences, such as workflow disruption3 and clinician deskilling4,5. Therefore, taking a holistic perspective that combines both patient data and clinician behaviors can offer a more comprehensive view of clinical care and open new opportunities for building and assessing clinical AI applications.

EHR use metadata refer to a collection of various types of event logs—including audit log and other log types documenting information such as clinical decision support alerts and secure messages—that capture user interaction with the EHR, as well as the creation and use of clinical data6. They provide a granular, longitudinal, and objective record of how individuals (clinicians, administrators, and patients) engage with patient records and utilize various EHR functionalities for patient care activities7,8. They systematically document actions performed within the EHR and system events triggered by these actions, specifying who, in what role, performed each action, what was done, on which patient’s record, when, and where. Recognizing that clinicians devote a substantial portion of their time to engaging with the EHR, over the past decade, researchers have increasingly leveraged EHR use metadata to characterize clinician EHR usage patterns and assess health system efficiency9,10. While EHR use metadata may not serve as a precise proxy for clinician behaviors, key dimensions of these behaviors can be inferred, explicitly or implicitly, using well-designed analytical strategies and careful validation. Prior studies have introduced a broad range of metrics to evaluate various dimensions of clinical behaviors in the EHR, such as clinician daily workload11,12, cognitive burden13, and patterns of collaboration among care team members14,15. These metrics have been instrumental in revealing links between clinician behaviors and key patient outcomes9,10,16,17,18 (such as length of stay, mortality, and readmission risk) as well as clinician well-being19,20 (such as burnout, job dissatisfaction, and intentions to leave the profession).

Patient data and EHR use metadata represent two interdependent yet fundamentally distinct information sources. Patient data (e.g., diagnoses, procedures, measurement results, and medication orders) characterize a patient’s health journey and the outcome of care. In contrast, EHR uses metadata capture the process by which those data is generated and used for care by encoding clinicians’ reasoning, prioritization, and workflow decisions, which might not directly determine clinical outcomes. Notably, these behavior traces can serve as proxies for unrecorded clinical observations and decisions. They also illuminate the context of how and why certain clinical observations are captured in the first place. For example, while a CT scan result in patient data may show no acute findings, EHR use metadata capturing how urgently the scan was ordered, how soon it was reviewed, and whether it led to additional workups (e.g., ordering additional laboratory tests or specialist referrals) can reveal escalating clinical concern. This action sequence reflects the clinician’s evolving level of suspicion regarding potential differential diagnosis, thus providing a more informative context than the scan result alone. Moreover, EHR use metadata can surface patient status that are absent from patient data. Consider a scenario in which a patient’s oxygen saturation level remains within normal limits that suggests stability; however, EHR use metadata indicating frequent reviews of respiratory data, rapid chart navigation, and multiple clinicians documenting notes in quick succession may reflect heightened clinical vigilance and can precede abrupt clinical deterioration, which offers early insights unavailable in patient data. Utilizing this complementary relationship is important because it conveys not only what happened to a patient but also the underlying clinical rationale, the care team’s interpretation of the patient’s condition, and the sequence of decision-making activities over time.

Recent attempts in clinical AI highlight the transformative potential of integrating information derived from EHR use metadata throughout the AI lifecycle, ranging from model development21,22,23,24,25,26,27 to post-deployment evaluation28,29. This integration enables the creation of more adaptive, context-aware, and robust AI systems that can better withstand data shifts and facilitate continuous performance monitoring and improvement in real-world clinical settings. Here, we propose a paradigm shift in clinical AI toward a more integrated approach—one that harnesses both patient data and EHR use metadata to embed the collective insights and behaviors of care teams into model development and evaluation (Fig. 1a). By combining these complementary data sources that contribute unique aspects of clinical context, we can achieve a more holistic understanding of clinical care to inform the development and evaluation of clinical AI. This, in turn, paves the way for more effective and trustworthy patient care that is centered on the needs of both patients and clinicians.

Fig. 1: Dual-Lens clinical AI framework combining patient data with clinician behaviors.
figure 1

a An overview of the Dual-lens clinical AI lifecycle. b An illustration of patient data generation within the EHR, with clinicians’ role highlighted in blue arrows. c A comparison of the relationships captured by traditional clinical AI (upper) and Dual-Lens clinical AI (lower), with novel signals highlighted in purple.

This Perspective explores the current research landscape surrounding EHR use metadata, highlights the potential and opportunities of the new paradigm in clinical AI, and outlines key challenges and considerations for the future. We will use the terms EHR use metadata, metadata, and event logs interchangeably throughout the paper to refer to the data that characterize clinicians’ actions in the EHR.

Current use of EHR metadata in research

Beyond supporting privacy auditing mandated by HIPAA, the current use of EHR use metadata is predominantly shaped by its clinician-oriented (or EHR user-oriented more broadly) nature. Most existing research focuses on converting event logs (even as granular as mouse and keyboard clicks) into higher-level, clinically meaningful information to characterize clinician activities and understand care delivery that relies heavily on the EHR30,31,32,33,34,35. Specifically, researchers have explored (1) deriving quantitative metrics, (2) mapping event logs to task-level workflows, and 3) modeling team structures and dynamics. These efforts seek to reveal patterns across various contexts (such as specialties12,30,31, clinician demographics34,36,37, national health systems38, the COVID-19 pandemic36, and reimbursement policy changes39) to guide improvements in efficiency, coordination, and overall quality of care delivery.

Measuring EHR usage via diverse metrics

Both EHR vendors and researchers have increasingly designed and applied EHR use metadata-based metrics for different purposes9,40. Various general metrics have been introduced12,41,42,43,44,45,46,47, such as login frequency, session duration, frequency of actions, time spent on actions (e.g., total EHR time, note documentation time, activity outside normal working hours, time after patient check-out, message volume, and their normalized values). More sophisticated metrics often involve the navigational patterns across the EHR functionalities. For example, analyzing event sequences that reflect the context-specific intensity and variability of clinician EHR engagement enables the development of key metrics such as time spent managing the In-Basket48,49,50, workload of follow-up actions triggered by alerts51,52, and the frequencies of note template usage53,54. These metrics help to dissect the temporal distribution of various EHR activities and reveal patterns in a specific context, thereby providing meaningful insights into EHR interaction efficiency, system usability, and associated burnout issues.

Mapping event logs to clinical tasks, workflows, and cognitive load

As individual events or actions recorded in EHR use metadata are often highly granular and fragmented, they lack meaningful clinical context when considered in isolation. Therefore, it is essential to aggregate event sequences into clinical tasks performed within the EHR10,55,56,57,58,59,60,61. Once these tasks are delineated, they can be systematically mapped to broader workflows or pathways. This mapping serves not only to understand how clinicians navigate the system in parallel with providing real-world patient care but also to help characterize EHR-based cognitive burden. For example, Lou et al. develop metrics of attention switching from event logs as a proxy for cognitive burden17. The attention switching metric shows face discriminant validity as it is associated with increased total time in the EHR and wrong-patient errors. Similarly, large language model-based approaches have been used to develop an action-as-language framework to characterize cognitive burden26. This framework has been used to identify cognitively burdensome tasks at scale, such as switching to and from the inbox. More broadly, identifying frequent task interruptions, deviations from standardized best practice, or prolonged searches for patient information and system functionalities can indicate increased cognitive load of clinicians, with implications for both clinician burnout and patient safety. The line of research, though facing challenges due to the complexity of raw event logs and the diversity of clinical workflows, enables the identification of friction points within the system, such as suboptimal interface design, excessive data fragmentation, and inefficient task flows. By addressing these issues, there is an opportunity to streamline workflows, alleviate cognitive burdens, and ultimately enhance overall clinical efficiency and care quality.

Characterizing team structures and dynamics

Growing evidence suggests that EHR use metadata offers valuable insights into how clinicians are structured as multidisciplinary teams and how they collaborate with each other in care delivery9,10,40. By analyzing patterns such as concurrent logins and co-access behaviors, the structure of care teams and the roles of team members can be represented as a network16,55,62,63,64,65,66, i.e., patient-sharing network (PSN). The topological features of a PSN, including measures such as centrality and betweenness, have been leveraged to characterize the exchange intensity of patient information and medical expertise across team members. Several studies have revealed that specific network configurations, such as densely interconnected care teams and those with extensive collaborative experience, tend to be associated with better patient outcomes, whereas other network configurations, such as fragmented or sparse networks, are often linked to less favorable results16,18,67,68. Importantly, this network-based approach also enables the tracking of how collaboration patterns evolve over time, including significant shifts observed in response to major disruptions such as the COVID-19 pandemic63,66. While PSNs may not capture every aspect of real-world collaboration, this approach has been accepted as a useful proxy for revealing the underlying patterns of information exchange and professional interaction, shedding light on how staffing structures should be improved for better effectiveness.

An integrated clinical AI paradigm

EHR use metadata provides benefits that extend much further than the basic assessment of EHR utilization. Notably, patient data are not merely passive reflections of disease development, but rather, they are actively shaped by the sequential reasoning and decision-making of clinicians as they interpret evolving clinical evidence and respond to a patient’s health status (Fig. 1b). Each clinical data point (such as a laboratory test result, a disease diagnosis, and an initiation of an intervention) reflects not only a patient’s physiological state but also the cognitive judgments and decision pathways of the care team. This dynamic interdependency between patient physiology and clinician decision-making implies that most modeling objectives of clinical AI are inherently determined by both patient-specific factors and clinician-driven choices. Therefore, it is imperative for clinical AI models to embrace an integrated approach that combines both information to ensure a more contextually grounded representation of clinical practice.

The conceptual framework depicted in Fig. 1a, which we named “Dual-Lens clinical AI” (DL-ClinAI), illustrates the new paradigm of clinical AI lifecycle advocated in this Perspective. DL-ClinAI comprises three key components: (1) aligning patient data and EHR use metadata along a common temporal axis, (2) selecting appropriate AI models to learn the complex relationships either between dual-lens features and the target objective (i.e., training task-specific clinical AI models) or within dual-lens features themselves (i.e., developing clinical AI foundation models), and (3) leveraging the dual-lens features to assess the impact of clinical AI tools throughout their development and deployment.

Data alignment

The objective of this component is to achieve coherent integration of patient data and EHR use metadata along a shared temporal framework. This goes beyond simple chronological matching and requires careful consideration in addressing the complexities involved in aligning distinct data streams. First, timeframe unification is necessary to map patient-specific clinical data points (e.g., a laboratory test result or an administered medication) and clinician actions captured by EHR use metadata (e.g., a nurse viewing the patient’s flowsheet or a specialist appending clinical notes) onto a single temporal framework. Since all these events are timestamped in the EHR, initial temporal alignment is relatively straightforward. However, this alignment can be further processed by anchoring events to critical clinical milestones (e.g., symptom onset, intervention initiation, hospital admission, or unit transfer), thereby establishing a richer clinical context valuable for downstream analysis.

Second, feature scope determination defines the range and resolution of the features selected for model development to ensure they are relevant and reflective of the clinical context. Deciding whether to incorporate the complete sequences of raw event data, filter for a specific subset, or apply an appropriate level of aggregation needs careful assessment. Not all raw clinical data points or clinician actions are equally meaningful for a specific downstream analysis. Domain experts may identify those that are essential to a specific modeling objective, such that key signals can be isolated from noise. For example, in a real-time patient deterioration surveillance system in acute or intensive care settings, Rossetti et al. use expert-determined features, including the frequencies of note writing, vital sign measurements and comments, medication administration, and a specific set of pertinent symptom terms from nursing notes, all of which are deemed to have strong predictive power for deterioration events22,27. On the other hand, by mapping specific clinician action sequences to known clinical tasks or care processes, it is possible to convert fine-grained event data into aggregated features with updated timestamps that encapsulate broader clinical workflows56. The analytical value of these features can be further enhanced by incorporating additional contextual metadata, such as information about care team composition and roles69, staffing changes70, shift handoffs71, or even patient-provider communications through the portal72,73. These contextual elements enrich the feature set by providing insights into the organizational and interpersonal dynamics that might influence patient outcomes and clinician decision-making. For instance, changes in staffing levels or shift transitions might correlate with certain care processes, while portal logs might reveal early indicators of patient concerns or adherence issues.

Third, harmonizing data granularity is critical for creating a unified feature space from data recorded at varying levels of detail. Harmonization reconciles differences across data sources to ensure that atomic events and aggregated data are consistently aligned. This can be typically achieved by time windowing, which aggregates data into fixed intervals (e.g., minutes, hours, or days) such that events with high frequency can be represented as summary statistics. For example, in early sepsis detection, high temporal resolution, e.g., every 10 minutes, is required to capture transient fluctuations in vital signs (e.g., mean, maximum, and minimum values) and to track dynamics in clinician actions in the EHR (e.g., adjustments to medication administration and vital sign monitoring patterns). This level of granularity ensures that subtle but critical changes are preserved for analysis. In contrast, for chronic disease progression applications such as managing chronic kidney disease, a coarser granularity, such as weekly or monthly, may be preferable to track metrics like glomerular filtration rate, creatinine levels, and the frequency of medication regimen adjustments, chart reviews, and note documentation by specialty. This level of harmonization helps smooth out short-term noise and emphasizes longer-term trends, ultimately supporting the recognition of overarching patterns that inform more reliable outcome forecasting.

Fourth, validation of data alignment needs to be performed to ensure that the unified temporal framework accurately reflects the sequential clinical processes in real-world settings. In addition to confirming the correct sequencing of events—particularly those aggregated from atomic clinician EHR actions—using their timestamps, this process can involve cross-referencing key clinical milestones, such as admission or discharge times, diagnostic tests, and therapeutic interventions. Integrating expert clinical review alongside automated, rule-based sanity checks can help refine the alignment by embedding human insights into typical workflow patterns and the causal structures between patient physiological states and clinician actions.

Training task-specific clinical AI models

Upon the completion of data alignment, a wide range of AI-assisted clinical tasks can benefit from a dual perspective approach that integrates conventional patient data with enriched insights from EHR use metadata. This approach has the potential to transform diverse applications (Table 1) in a way that enables the modeling of the complex relationships (Fig. 1c): 1) between patient data and EHR use metadata, (2) between EHR use metadata and the target outcome, (3) between the interaction in (1) and the target outcome, and (4) within EHR use metadata, while preserving the traditional associations captured between patient data and the target outcome. Several recent clinical AI studies, aligned with the design principles of DL-ClinAI, have demonstrated substantial benefits in both performance and reliability. In a critical patient-level prediction task (i.e., daily hospital discharge), Zhang et al. enhance a tree-based machine learning model by integrating counts of distinct action types in the EHR performed by care team members during the past 24 hours, alongside conventional features such as daily updated patient data and day of week23. This integration significantly improves the area under the receiver operating characteristic curve from 0.86 to 0.92. The key takeaway message is that EHR uses metadata to encode granular semantics about clinical workflows and implicitly conveys clinicians’ evolving assessments of patient discharge status. Interestingly, the most predictive feature for next-day non-discharge is the high frequency of medical device barcode scanning—a proxy for ongoing treatment activity, which intuitively indicates that the patient is not yet ready for discharge soon. In the context of clinical outcome forecasting, Bhaskhar et al. demonstrate that integrating clinician actions from EHR use metadata with structured EHR information significantly improves the prediction performance of major adverse kidney events within 120 days of ICU admission in patients with acute kidney injury, as well as 30-day readmission in acute stroke patients24. Notably, this approach proves substantially more robust to temporal data distribution shifts, a common challenge in healthcare data that often undermines the reliability of clinical AI applications. These findings suggest that clinician actions in the EHR can enrich the contextual understanding of care and serve as a stabilizing factor that anchors model predictions in real-time clinical judgment and workflow dynamics. Another notable example is a 1-year, cluster-randomized clinical trial of 60 K hospital encounters across two institutions, where the early warning surveillance system of patient deterioration, powered by EHR use metadata, significantly reduced in-hospital mortality (−35.6%), length of stay (−11.2%), and sepsis risk (−7.5%) compared to the control arm21. Such solid evidence underscores the value of integrating EHR use metadata into clinical AI to deliver real-time, context-aware insights that guide clinician actions and support timely interventions.

Table 1 Examples of EHR-based clinical tasks DL-ClinAI can support

The selection of models for training must be aligned with the specific clinical task. Because real-world clinical AI applications are executed on a recurring or continuous basis as the accumulation of new patient data and clinician actions in the EHR, the selected models must be able to use the most recent information that could update the model’s beliefs about its target training objective. In practice, this means prioritizing models that support rapid updating or incremental learning, offer mechanisms for data drift detection, and maintain calibration as the underlying data distribution evolves74. Longitudinal models that can handle multivariate sequences, such as transformer-based encoders, temporal convolutional networks, and other recurrent neural networks, are particularly well-suited, because they can ingest the timeline-aligned data and continuously refine their internal representations to conduct predictions or classifications75. When events far in the past add little value to a prediction or classification, and exact ordering of recent events is not critical, simpler models, such as shallow feed-forward networks and tree-based methods, may be preferable76. These models often achieve strong performance without the computational overhead and infrastructure demands of more complex sequence-handling architectures.

Training clinical AI foundation models

Developing clinical foundation models from large-scale patient health records has recently gained considerable interest77,78,79,80,81. The core rationale is straightforward: instead of collecting a bespoke dataset and training a separate model for every individual clinical task, one can pretrain a single, high-capacity foundation model using a unified, longitudinal clinical dataset and then adapt it to a broad range of downstream tasks. Pretraining enables the model to learn the semantics and latent structures of patient health trajectories by solving self-supervised objectives, such as predicting the next clinical event (e.g., diagnosis codes or laboratory test values) or reconstructing masked events along the timeline82. These objectives require no manual labels. Once pretrained, the foundation model can be quickly specialized through lightweight adaptation to support diverse clinical tasks across multiple domains80. This mechanism not only dramatically reduces the burden of data curation and model development but also promotes knowledge transfer to low-resource settings, improving both scalability and generalizability of clinical AI.

The DL-ClinAI framework naturally extends to the development of dual-lens clinical foundation models, which can be pretrained on timeline-aligned patient data and EHR use metadata. This integrated pretraining enables the model to learn not only the progression of patient health states over time but also how clinicians act upon those evolving states within real-world clinical workflows. In this setting, self-supervised objectives can be applied to alternate between predicting clinician actions and patient-specific clinical events. Self-supervised contrastive learning can also be performed to align the patient-data stream with the corresponding clinician-action stream for the same patient and clinical context. By modeling this interdependency, the foundation model is expected to establish a rich and contextual representation of the interplay between patient health states and clinical decision-making. These patient representations are particularly valuable for downstream tasks that require sensitivity to workflow dynamics, practice variability, or team-based decision patterns, where patient data-based foundation models fall short. Importantly, the dual-lens foundation model offers a unique advantage: it can easily simulate clinician behavior sequences between clinical events, effectively filling contextual gaps where traditional models lack information. For example, it can infer whether a change in treatment was preceded by increased monitoring, a specialist consultation, or documentation activity, all invisible in patient data but critical for understanding the full care context. Additionally, this dual-lens foundation model can support developing a stronger clinical digital twin83—a virtual, individualized representation of a patient’s physiological state over time that allows dynamic simulation of potential treatment strategy, monitoring and prediction of health trajectory, and early intervention and prevention, based on modeling of multi-modal patient data. The digital twin, powered by the dual-lens foundation mode,l can be leveraged to simulate the downstream effects of alternative care strategies, clinician responses, and EHR workflow configurations, which can enable direct analysis of how different clinician behaviors and operational patterns influence patient trajectories and outcomes. Specifically, such a digital twin can generate counterfactual scenarios to predict how a patient’s outcome might change under earlier monitoring, delayed documentation, or different triage strategies, such that the best practices, inefficiencies, and even medical errors can be identified in a data-driven, low-risk environment.

Impact assessment

Beyond evaluating the accuracy of the core algorithm of a clinical AI tool, the real-world impact of such a tool hinges on numerous factors that researchers often underappreciate, including, but not limited to, interface design, integration level into existing workflows, alert timing and volume, clinician training and trust, ongoing performance monitoring and recalibration, data quality safeguards, interoperability with other systems, and governance structures for oversight and accountability84,85. Neglecting any of these factors can undermine even the most accurate model, leading to clinician frustration, workflow disruption, or patient safety risks that erode the tool’s intended benefits. DL-ClinAI underscores that the assessment of clinical AI tools must consider two synchronized streams of evidence: objective patient data and the granular behavioral traces clinicians leave in the EHR. Patient short-term and long-term outcomes reveal whether a clinical AI tool actually benefits care, while EHR use metadata show how the tool changes workload, decision pathways, potential automation bias, and all other aspects linking to outcome changes of both patients and clinicians. Only by examining these two strands of evidence in tandem can health systems determine whether innovations such as AI scribes, early warning systems, or draft-reply assistants truly help or simply shift workload, introduce new errors, or erode clinicians’ skills.

EHR use metadata has clear strength for assessing post-deployment impact of clinical AI across several complementary dimensions28,29,86,87. First, timestamped clicks, scrolls, keystrokes, and order events make it possible to trace what clinicians do immediately after an AI-generated alert or suggestion appears (e.g., how long until an order is placed, whether an imaging study is ordered, or which EHR function or section they navigate to). These action sequences reveal whether a tool is actually used by clinicians and whether it truly streamlines care pathways or unintentionally inserts detours and delays. Second, traditional aggregate metrics like total time in the chart, after-hours click counts, or the number of simultaneously open patient charts can be used to quantify how the tool shifts effort and attention. Increases or decreases in these metrics may serve as proxies for cognitive load and burnout risk because of the clinical AI deployment. Third, simple but powerful indicators, such as the proportion of AI-generated text that clinicians sign without modification, the keystrokes or edits required to revise drafted replies before sending, the frequency of same-day note revisions, or the note-to-order latency, may surface automation bias, latent errors, and potential patient safety threats that would otherwise remain hidden. Additionally, capturing and analyzing what is edited (or canceled) and why can directly help refine the algorithm that leads to the modifications. Fourth, longitudinal trends in manual order entry, free text documentation length, template diversity, or resident versus attending contribution rates flag whether clinicians are maintaining core judgment and reasoning skills or drifting toward overreliance on AI assistance. Monitoring these patterns supports proactive retraining and safeguards clinical competence. Fifth, by systematically quantifying and comparing how workload is redistributed across the entire care team in aspects such as time-in-system by role, branching complexity, length of hand-off chains, frequency of simultaneous chart access, and any new coordination bottlenecks after tool deployment, health systems can reveal how team collaboration is impacted.

Challenges and opportunities

Heterogeneity in vocabularies and resolution

As EHR vendors differ as fundamentally as operating systems, innovations that rely on EHR use metadata must deal with cross-vendor architectural differences. Indeed, EHR use metadata differ widely in both action vocabulary and resolution9,88,89. Each EHR system offers its own interface, feature set, and recommended workflows, resulting in a distinct catalog of atomic action types. A task that appears as a single, high-level event in one system might be recorded as a sequence of fine-grained clicks in another, while certain activities captured in one system may have no direct counterpart elsewhere. Further heterogeneity arises when medical institutions running the same vendor’s EHR system customize their local deployment and create site-specific action types. These variabilities complicate the creation of universal, meaningful metrics of EHR use, hamper cross-institutional benchmarking, and undermine the portability of clinical AI development. To fundamentally resolve these challenges, we advocate for (1) a vendor- and organization-agnostic action vocabulary, which creates a common semantic building-block layer, (2) shared logging specifications that mandate a minimum set of timestamped interaction elements, such that the essential information of each event is always captured and portable, and (3) robust mapping frameworks that link raw events to higher-level clinical workflow concepts. By combining a universal lexicon, an enforced logging standard, and robust mapping technology, the community can produce a durable foundation such that clinical AI models can be trained once and deployed broadly with appropriate adaptation. Standardization at these layers is also important for comparing clinician-EHR interaction patterns, validating AI tools across sites, and ultimately realizing scalable, trustworthy clinical AI deployments. Alternatively, the heterogeneity can be mitigated technologically by using or finetuning a domain-specific language embedding model to encode textual descriptions of action types into a shared latent space, such that semantically similar actions from different vendors or local customizations cluster together. This enables clinical AI models to reason over semantically equivalent actions without relying on vendor- or institution-specific vocabularies. Nevertheless, it should be noted that, to account for potential data shift, any model trained at one institution should undergo formal external validation and calibrated fine-tuning before clinical use elsewhere.

Noise in data

Raw EHR use metadata are inherently noisy because they record every interaction—whether it is part of a clinical workflow, training, or illegal accesses that breach patient privacy90. Duplicated clicks and keystrokes, automatic pop-up notifications, system-initiated refreshes, asynchronous logging that produces partially recorded action sequences, and interleaved event streams from multitasking or shared workstations all add to the complexity of turning these traces into informative features and models. Further complicating factors are machine-generated entries and workflow artifacts produced to satisfy process requirements or institutional policies. In the worst case, when left unfiltered, the aforementioned artifacts can mislead model training to overfit to interface peculiarities rather than clinically relevant behaviors and can mask safety-critical actions. Mitigating this risk demands rigorous data preprocessing for building reliable clinical AI systems, such as removing duplicated events, filtering out sequences from irrelevant roles, feature selection by human experts, aggregating low-level actions to higher-level tasks, re-ordering out-of-order events or tasks, as well as post-hoc model explainability techniques that identify the influential features or subsequences. Notably, identifying and separating machine-generated entries has been made possible based on their patterns of occurrence.

Off-screen clinician behaviors

Despite the advantages of EHR use metadata, they record only the interactions that occur within the EHR, leaving out substantial dimensions of clinician behaviors that unfold off-screen7,91. Many critical activities, such as team huddles, bedside hand-offs, informal hallway consultations, and direct patient-clinician communications, are invisible in event logs, yet they often drive decision pathways and care coordination. Equally elusive are the cognitive processes underpinning care (e.g., mental model updates and multidisciplinary deliberations), which might not be reliably inferred from click stream data alone. To bridge this gap, the development and assessment of clinical AI tools must be supplemented with richer clinician behavioral data sources, such as audio or video transcripts of interactions in the physical world, sensor outputs, and ambient AI data, so that they can leverage the full spectrum of clinician activity and decision making.

Potential bias

Integrating EHR use metadata into clinical AI development and evaluation at scale must safeguard against the propagation or amplification of pre-existing biases, ensuring that clinical decision-making remains fair and that research conclusions are not flawed. EHR use metadata mirrors the realities of resource constraints, documentation norms, staffing hierarchies, vendor- and institution-specific biases, and other dimensions where intentional or unintentional bias can arise during care. Previous research has shown, for example, that healthcare professionals engage with the EHR differently for patients from different demographic groups41. When such variations affect the thoroughness and precision of care, they can lead to quality disparities, thereby biasing all downstream model development and analysis if ingested without critical evaluation. These issues, however, are not unique to EHR use metadata and have analogs in almost all clinical data, suggesting that existing mitigation strategies can be adapted as potential solutions. Guarding against these risks requires periodic, stratified checks of input data, carefully selected model training strategies, and continuous performance auditing across dimensions like patient demographics, clinician roles, and care settings to detect and correct harmful biases. Based on emerging guidance for race-aware AI, practical safeguards can be applied across the lifecycle of DL-ClinAI92. At the data preprocessing stage, a dataset “nutrition-style” label can be created to document essential information such as institutional and workflow context, clinician expertise composition, and data missingness. When imbalances are identified, approaches such as stratified sampling, targeted oversampling, and synthetic data augmentation can help restore representativeness. For model development, bias-aware loss functions, constraint-based optimization, and adversarial learning can be leveraged to ensure model fairness. Trained models must be rigorously evaluated on temporally and institutionally held-out datasets, with stakeholder feedback informing model retraining decisions when necessary. Finally, deployment safeguards should include silent-mode validation prior to launch, followed by model performance and fairness dashboards that monitor potential drift and trigger recalibration or retraining to ensure an ongoing feedback loop.

Shaping the future through shared innovation

Pairing the behavioral signals embedded in EHR use metadata with traditional patient data opens a transformative path toward clinical AI that is not only more accurate but also more context-aware, trustworthy, and tightly aligned with real-world patient care. By revealing how clinicians act on patient data, this dual-lens approach enables AI systems to fit seamlessly into workflows and reveal opportunities for improvement. However, this vision requires addressing multifaceted challenges through concerted effort. We encourage researchers, clinicians, policymakers, and industry partners to join forces in establishing vendor-agnostic data standards, building scalable preprocessing pipelines, and developing bias-sensitive evaluation frameworks that support the effective and fair utilization of EHR use metadata throughout the lifecycle of clinical AI. Leveraging such collaboration, the healthcare community can shape an AI-enabled future where technology meaningfully augments clinical expertise and drives safer, more efficient, and more equitable healthcare delivery.