Introduction

Understanding how the brain gives rise to complex behaviour remains one of the central challenges in neuroscience. Although decades of research have elucidated the neural mechanisms underlying simple sensory or motor tasks, a mechanistic understanding of higher-order behaviours, such as decision-making, social interaction and cognitive flexibility, remains elusive. Progress in this domain is critically dependent on our ability to link brain activity with behaviour at appropriate levels of abstraction and resolution1,2,3. Joint brain–behaviour modelling has been a key methodological advance towards achieving that goal.

Recent years have seen major advances in both neural recording technologies and behavioural measurement tools4,5,6,7. On the neural side, large-scale electrophysiology, calcium imaging and neuromodulatory tagging enable the simultaneous recording of activity from hundreds to thousands of neurons across multiple brain regions5,8,9. On the behavioural side, high-resolution video, inertial sensors and pose estimation techniques have made it possible to capture fine-grained behavioural dynamics over time2,7,10,11,12,13. These parallel advances open the door to a deeper understanding of how distributed neural populations coordinate to drive complex behaviours, but only if they are integrated analytically.

Artificial intelligence (AI), which encompasses modern machine learning, deep learning and agent-based systems, has created tremendous advances for many scientific applications, ranging from protein design14 to weather prediction15. Naturally, AI also has had a tremendous impact in neuroscience on joint modelling approaches, which provide a statistical and computational framework to bridge neural and behavioural data. Rather than analysing each domain in isolation, joint models capture the shared structure between neural dynamics and behavioural outputs, enabling researchers to test hypotheses about how neural data are related to behaviour and vice versa (see refs. 16,17 for excellent probabilistic neural modelling reviews).

In this Review, we survey recent progress in joint modelling of neural and behavioural data, with a focus on methodological innovations, scientific and engineering motivations, and key areas for future innovation. We begin by giving some background on advances in AI that are relevant for understanding neural, behavioural and joint modelling approaches. We then survey the main optimization approaches relevant for joint modelling — discriminative, generative and contrastive — along with their limitations and advantages. We next discuss how these tools reveal the shared structure between neural activity and behaviour and how they can be used for both scientific and engineering aims. Then, we describe recent advances in behavioural analysis approaches, including hierarchical behaviour analysis, which could influence the next generation of joint models. Finally, we argue how considering not only the performance of models but also metrics of their trustworthiness and interpretability can help to advance the development of joint modelling approaches.

Principles of deep learning models

Fundamentally, the goal of AI models often amounts to solving challenging perception and decision-making problems. For instance, one needs to decide based on the recorded audio signal whether a rat is emitting an ultrasonic vocalization18 or based on a video whether the rat is doing unsolicited jumps (Freudensprünge)19. Experts can readily score such events, and it should be no surprise that AI systems are also increasingly capable of doing so. In broad strokes, these perception problems can now be solved with AI. Here, it is also worthwhile to remember that AI systems at times solve perception problems with algorithms that are at least loosely inspired by the brain20,21. In this section, we look more closely at how these AI systems achieve such perceptual capabilities — focusing on machine learning and deep learning fundamentals that underlie their success. By briefly examining how these methods operate and differ, we can better appreciate both their power and their limitations for joint modelling of neural data and behaviour.

Machine learning systems consist of four key components that work together to solve problems: a data set, a model, a loss function and an optimization algorithm22,23,24. The data set defines the input–output relationships that the model should learn; for instance, for ultrasonic vocalization identification, the system must predict a binary output (no call versus call) from a particular audio waveform input. The model serves as the mathematical framework that transforms these inputs into outputs through adjustable internal parameters. The loss function measures the quality of the model’s predictions by comparing them with the ground truth data, providing a numerical score that assesses performance. Loss functions quantify prediction error and are closely related to objective functions — the general term for any function being optimized (whether minimized or maximized). Finally, the optimization algorithm iteratively updates the model’s parameters to minimize this loss, effectively steering the model towards better performance. The specific choices made about these four components directly influence both the possible performance and the robustness of the overall machine learning system. Technically, this is the definition of supervised learning systems when the data have labels, namely, where input–output pairs are given. We later discuss self-supervised learning, which learns from unlabelled data by creating supervisory signals from the data’s own structure. This self-supervised paradigm lies at the heart of innovations for joint brain–behaviour modelling.

Before the advent of deep learning, classic (supervised) machine learning used domain-specific feature engineering (via a fixed encoder) followed by trainable classification (via a decoder). For ultrasonic vocalization processing, raw waveforms could be transformed via auditory filter banks into statistical descriptors (akin to what the cochlea does). In this case, those filter banks are the encoder. They extract features from the raw waveforms and these features are fed into a classifier (decoder) to predict calls. Only the decoder is trained whereas the encoder remains fixed, reflecting historical constraints where domain knowledge in the encoder compensated for limited learning capacity in the decoder.

Deep learning revolutionized this approach by making both the encoder and the decoder (alternatively called the backbone and output heads) learnable components implemented as deep neural networks. These networks consist of multiple layers of differentiable, non-linear transformations that are optimized together. Unlike classic approaches that rely on handcrafted features, deep neural networks optimize the feature representation directly for the task at hand, learning which aspects of the input are most relevant22,23. Neural networks, particularly deep architectures, excel in extracting hierarchical features that progress from simple local patterns to complex global structures. Given sufficient training data, this end-to-end learning yields superior performance and robustness. Typical model architectures are multilayer perceptrons (MLPs) (Fig. 1a), convolutional neural networks (CNNs) (Fig. 1b), recurrent neural networks (RNNs) (Fig. 1c), transformers (Fig. 1d) or state-space models (Fig. 1e). Although these architectures differ in structure, they all function as universal approximators capable of learning complex mappings when provided with enough capacity (model size; the number of adjustable internal parameters) and data25. However, the choice of architecture for both the encoder and the decoder may substantially impact both data efficiency and final performance24.

Fig. 1: Common neural network architectures.
figure 1

a, Multilayer perceptrons (MLPs) are neural networks composed of fully connected layers, where each neuron receives weighted input from every neuron in the preceding layer. This dense connectivity allows MLPs to learn complex non-linear mappings between inputs and outputs, although at the cost of a large number of parameters. b, Convolutional neural networks (CNNs) process grid-structured data such as images by applying learnable filters across spatial dimensions. In this toy example, three initial convolutions with weight sharing create feature maps, which are then downsampled via pooling to reduce spatial dimensions while retaining important features. Additional convolutions follow and, finally, fully connected layers predict outputs. CNNs exploit weight sharing and hierarchical feature extraction. In vision tasks, it is well known that they progressively build from edge detectors to complex object representations. c, Recurrent neural networks (RNNs) such as gated recurrent unit networks (GRUs) or long short-term memory networks (LSTMs) process sequential data by maintaining a hidden state h(t) that evolves over time t, passing information from one time step to the next via a recurrent connection that combines h(t − 1) and x(t). This recurrent connection acts as the network’s memory, allowing RNNs to capture temporal dependencies and patterns in sequential input data x(t) to create the output y(t). They can struggle with long-range temporal dependencies due to vanishing or exploding gradients. d, Transformers have revolutionized sequence modelling by replacing recurrent connections with self-attention mechanisms. These mechanisms compute relationships between all positions in a sequence simultaneously, enabling the capture of long-range temporal dependencies while maintaining computational parallelizability — a key advantage over sequential architectures. Transformers process input tokens (embeddings (emb)) combined with positional embeddings (post enc) through layers containing multi-head attention, MLPs and skip connections. Skip connections (also called residual connections) bypass intermediate layers (here, attention and MLP). ‘Add and norm’ blocks implement these skip connections: ‘add’ sums the input with the layer output (residual connection) whereas ‘norm’ applies layer normalization, together improving gradient flow and training stability35. e, State-space models provide an alternative approach to sequence processing by modelling data as continuous-time dynamical systems also with hidden states (h(t)). Through learned state transitions and efficient discretization schemes, state-space models can handle extremely long sequences with linear computational complexity as indicated by the state equations, making them particularly attractive for tasks requiring long-context understanding.

The loss function shapes what the model learns by defining success. In supervised learning, where there are labelled examples, the loss function typically measures the prediction error, such as using cross-entropy for classification tasks or the mean squared error (MSE) for regression (Box 1). The optimization process that minimizes the loss generally employs gradient-based methods24. Collectively, this framework of data set, model, loss and optimization provides a unified lens for understanding all types of machine learning systems, establishing the vocabulary that we use throughout our Review.

What is brain–behaviour modelling, and what is the goal?

Understanding how neural activity gives rise to hierarchical behaviour requires integrative modelling approaches that can extract structure from high-dimensional, heterogeneous data modalities. Overall, one is interested in modelling the joint distribution P(behaviour, neural data), which can be achieved in numerous different ways and generally falls into four classes. Decoding models study how behaviour depends on neural data, P(behaviour | neural data). Encoding models instead study how neural data depend on behaviour (or more sensory input), P(neural data | behaviour). Latent models capture P(neural data) via self-supervised learning using generative or contrastive approaches, learning latent variables z that are then related to behavioural data. Latent variables are unobserved quantities that must be inferred from observed data and typically represent abstract features that capture underlying structure in high-dimensional observations. For example, whereas thermometers appear to measure it directly, temperature is fundamentally a latent variable — a statistical property of particle energy distributions that we infer from microscopic states. Similarly, in neural–behavioural modelling, latent variables represent aggregate properties of high-dimensional neural activity that we infer from observable spike trains and behaviour. Finally, joint models directly model the joint distribution of behaviour and neural data, P(behaviour, neural data). These different approaches immediately raise the question of which modelling approach is best suited for a given scientific or engineering goal.

From an engineering perspective, one may aim to build brain–machine interfaces (BMIs), where high performance in behavioural decoding and real-time execution are paramount. By contrast, a scientific goal may involve constructing a mechanistic model that captures the computational principles and dynamical processes underlying neural function — analogous to a digital twin in engineering, but focused on biological principles rather than exact replication. Such models should not only reproduce observed neural–behavioural relationships but also enable discovery of new principles through simulation, perturbation and hypothesis generation. Alternatively, the goal may be to test specific hypotheses about neural representations, necessitating interpretable latent variables that can be experimentally validated or falsified. A fourth objective involves exploratory discovery — using these methods to uncover novel patterns, cell types or computational motifs that were not previously known or hypothesized.

Each of these goals imposes different modelling requirements and evaluation criteria. Crucially, the answer cannot rely solely on decoding performance metrics such as spike prediction accuracy or behavioural reconstruction, as these metrics conflate fundamentally different objectives. A model achieving 99% decoding accuracy might use biologically implausible transformations that provide little insight into neural computation, making it excellent for BMI applications but unsuitable for mechanistic understanding. Therefore, articulating the scientific intent — whether engineering performance, mechanistic insight, hypothesis testing or open-ended discovery — is essential for guiding model selection, development and interpretation.

The diversity of scientific and engineering goals has naturally led to the development of these multiple modelling paradigms. Notably, whether one creates decoding, encoding, latent or joint models, broadly speaking there are three computational objectives (Fig. 2). Discriminative objectives for decoding are those that aim to predict behaviour (for example, spikes in, decode behaviour); however, we note that encoding models also use discriminative approaches and map from behaviour and/or stimuli to spikes3,26. Generative objectives for reconstruction are those that aim to reconstruct input data from learned latent representations (for example, spikes or behaviour in, predict spikes or behaviour). Contrastive objectives for encoding and joint modelling are those that aim to encode without reconstruction (for example, spikes and behaviour in, learn latents via representation learning). Although each of these three approaches can serve either a specific goal or multiple goals, they all flourish due to complementary trade-offs: discriminative models provide computational efficiency for targeted predictions; generative models enable sampling and uncertainty quantification; and contrastive methods leverage unlabelled data to discover representations that generalize across contexts. In the following sections, we examine each paradigm in detail, highlighting representative methods and their applications to neural–behavioural data.

Fig. 2: Three broad classes of neural–behavioural dynamics models.
figure 2

For all approaches, encoders and decoders can comprise different architectures (Fig. 1) and these learned models (encoders or decoders) can then be leveraged in downstream tasks. Although we focus on spikes as the primary input for illustration, other types of neural recordings (such as calcium imaging, local field potentials or functional MRI) can also be used with these approaches. a, Discriminative approaches use input spikes to decode behaviour using supervised losses such as the mean squared error (MSE) for continuous variables and cross-entropy for discrete variables. During training, predicted behaviour is compared with ground truth behaviour to compute the loss and update model parameters. At inference time, the trained model outputs predict behaviour without requiring ground truth. b, In generative approaches, a model comprising an encoder and a decoder learns to generate spike data from latents. The encoder maps input spikes to latent representations, whereas the decoder reconstructs spikes from these latent codes. Reconstruction losses such as the negative log likelihood (NLL) or MSE compare predicted with ground truth spikes. In variational autoencoder (VAEs), training optimizes the evidence lower bound, which combines reconstruction loss with regularization on the latent distribution (Box 1). c, Contrastive approaches use input spikes, optionally with auxiliary variables (such as behaviour labels), to learn latent representations through contrastive learning. This approach achieves representation learning without explicit reconstruction (decoding). Namely, learning is based on attraction and repulsion dynamics: similar samples (positive pairs) are pulled together whereas dissimilar samples (negative pairs) are pushed apart in the latent representation. CNN, convolutional neural network; InfoNCE, information noise-contrastive estimation; NCE, noise-contrastive estimation; RNN, recurrent neural network.

Discriminative models: directly decoding behaviour from neural data

Decoding is a long-standing task in neuroscience, beginning with classical approaches such as population vectors and Kalman filters (reviewed elsewhere3). These methods established the basic framework for mapping high-dimensional neural activity to low-dimensional behavioural variables, which allows for both understanding the information present in a population of neurons and for engineering BMIs. As the field progressed, machine learning techniques such as support vector machines27 and decision trees were adopted to improve decoding performance. Today, in terms of performance, these have largely been superseded by deep learning models, including transformer-based architectures28,29 (Fig. 1d).

These modern decoding models are typically supervised, using behaviour directly as the target in the loss function, most often with the MSE (Box 1). Recent benchmarking efforts30,31 have formalized this, focusing on the accuracy of behavioural decoding (and the prediction of spikes) (see the section ‘Generative models: learning to predict spike trains via reconstruction’) as the key measure of success. Note that ‘behaviour’ is typically a discrete or continuous 2D variable, such as the velocity of the hand or 2D position of a cursor on a screen, but in the following we discuss how new approaches to measuring behaviour could change the nature of this decoding goal (see the section ‘Behavioural analysis for neuroscience’).

Indeed, transformer architectures are making impressive gains for decoding28,29,32,33. Their ability to flexibly model long-range dependencies and multimodal inputs has enabled state-of-the-art performance in behavioural decoding in comparison with supervised MLPs and RNNs (Fig. 1). One key advantage of transformers is their scalability: their architecture enables parallel computation, efficient use of large data sets and improved performance with increasing model size. Although attention operations are computationally expensive, self-attention can flexibly integrate contextual cues such as trial structure, sensory stimuli or task rules (Figs. 1d and 2a). This makes them especially suitable for data sets with complex temporal structure. Notably, tokenization of the spikes and leveraging positional embedding makes combining multi-session, multi-animal data more feasible. Newer scalable transformers such as Perceiver I/O offer greater flexibility and predictive power34. This enables fine-tuning and generalization to held-out data sets, paving the way for better foundation models29 (Box 2).

Yet there are also clear trade-offs in the complexity and speed of using large transformer models35,36, which has limitations for the deployment on devices37 and, practically speaking, the weeks of compute required for training29 can make this approach not viable for many laboratories. Therefore, although many new powerful approaches have been proposed in terms of decoding performance, there are ongoing efforts to build lighter-weight unified models that perform equally well even with smaller RNNs or MLPs38. For example, Sani et al.38 developed powerful lightweight models to extract task-relevant and task-irrelevant latent dynamics.

Generative models: learning to predict spike trains via reconstruction

Reconstruction-based approaches represent a powerful paradigm for learning latent representations of neural data without requiring labelled examples. Variational autoencoders (VAEs) are particularly well suited for neural data analysis because they learn probabilistic mappings between high-dimensional observations (x) and a (typically lower-dimensional) latent variable (z)39,40. A VAE consists of an encoder qϕ(z|x) (recognition model) that approximates the true but intractable posterior distribution (the probability of latent variables given observed data, p(z|x)), and a decoder (pθ(x|z)) that reconstructs the observations (data) from these latents by optimizing the data likelihood (Fig. 2b and Box 1). Unlike deterministic autoencoders, which map each input to a single point in latent space, VAEs learn probability distributions over latent representations, and thus model uncertainty in both the latent variables and the reconstruction process. This makes them especially valuable for capturing the inherent variability of neural data. We emphasize that VAEs are generative models, and the encoder is both a technical solution to learn the generative model and also a way to infer latent variables from data. VAEs are used in both of these ways in the literature (for example, see latent factor analysis via dynamical systems (LFADS) below). Importantly, the generative nature of VAEs also enables sampling novel neural patterns and quantifying uncertainty in latent variables (also called latent representations), which is essential for understanding the probabilistic structure underlying neural population activity.

In neuroscience, LFADS pioneered the application of VAEs by combining them with RNNs to model neural activity as a dynamical system41,42. LFADS can infer both trial-specific latent trajectories and putative inputs to the neural dynamics one is modelling. These learned latents and these input dynamics can then be related to behavioural and other experimental variables. To give some concrete examples, the learned representations have proven effective in decoding primate hand movements from the motor cortex and in detecting perturbations41. One can also learn models of neural dynamics across multiple experimental sessions (stitching) and use the generative nature of LFADS for sampling synthetic data41,42.

Whereas LFADS assumes continuous latent dynamics, switching linear dynamical systems (SLDS) takes a different approach by modelling neural activity governed by discrete state transitions. SLDS extends traditional state-space frameworks (Fig. 1e) by allowing the system to transition between multiple latent dynamical regimes over time. In neuroscience, these models have been used to flexibly capture non-stationary neural population dynamics. By inferring a sequence of discrete states from neural data, with each regime governed by distinct dynamics, SLDS models can reveal behaviourally relevant brain state switches, cognitive modes or neural circuit configurations43,44,45,46. For data from multiple individuals, it can be important to consider families of dynamical systems that share some parameters across individuals such as multi-task dynamical systems47. In general, their strength lies in their interpretability of the latent states that can demarcate transitions in neural dynamics. For example, one could use the resulting model to predict a context change or behavioural action switch from neural dynamics.

The reconstruction-based approach that defines VAEs is both their strength and their fundamental limitation. These methods optimize in raw data space where natural metrics (such as pixel distance or Poisson loss) may not capture meaningful similarity in the underlying (latent) structure48. For instance, neurons deviate from Poisson statistics but are commonly modelled in this way. The reconstruction requirement forces a trade-off between capturing input fidelity and learning task-relevant latent representations: capacity spent on high-fidelity reconstruction may not be available for capturing task-relevant latent structure. There is no guarantee that minimizing the reconstruction error will yield representations that are optimal for understanding neural–behavioural relationships. This misalignment between the metric used for reconstruction and the actual latent representation is a key challenge: the natural metric for reconstruction may not align with the meaningful structure in the data. This reconstruction challenge is evident in vision applications, where standard VAE objectives often produce blurry reconstructions — the model optimizes what one measures (pixel similarity) rather than what one cares about (perceptual quality). This motivated the development of more sophisticated generative approaches with diffusion models3,49. Recent work also leverages diffusion models and state-space models (Fig. 1e) to more realistically generate neural activity50. Another important limitation is that VAEs suffer from not producing consistent results (Box 1).

Contrastive models: learning latents without reconstructing data

Contrastive learning sidesteps the reconstruction dilemma. Instead of asking ‘how do we generate this neural pattern?’, contrastive methods ask ‘what makes this neural pattern similar to or different from other patterns?’. This reframing eliminates the need to specify spike-level (Poisson loss) or pixel-level (pixel distance) similarity metrics, allowing the model to focus on discovering native relationships in the data51,52,53,54,55. Contrastive learning learns latent representations by maximizing agreement between related samples (positive pairs) while minimizing agreement between unrelated samples (negative pairs), without requiring supervised (behavioural) labels or input reconstruction (such as spike reconstruction) (Fig. 2c). Crucially, such models avoid imposing strong generative assumptions or supervised targets, which may bias or constrain the learned representation.

A method called CEBRA has pioneered this approach for continuous and discrete time-series data, particularly neural data54,55. CEBRA operates by pulling positive pairs closer in latent space while pushing negative pairs, typically using objectives such as information noise-contrastive estimation (InfoNCE)51,56 (Box 2). For neural data, temporal proximity can serve as a natural basis for defining positive and negative pairs — neural patterns occurring within short time windows are treated as related (positive), whereas patterns separated by longer intervals serve as unrelated (negative) samples54. This self-supervised objective promotes embeddings that reflect the intrinsic temporal structure of neural data, capturing cognitive states and behavioural dynamics without requiring explicit labels or a supervised loss function. A core flexibility of the contrastive approach lies in how the positive and negative pairs are defined. A current limitation is that the time window is a tunable parameter but restricted to a single timescale; future efforts should allow for hierarchical time bins. Importantly, this approach naturally can be extended to joint modelling.

Joint models for inferring latent dynamics via representation learning

Indeed, in addition to using time-aware contrastive loss, CEBRA54 can also use time-aware plus auxiliary variables (labels) to guide which neural samples to attract together in the latent space. This use of labels allows for joint modelling of behavioural–neural data in a hypothesis-guided manner. Because the positive pairs can be crafted from the auxiliary variables (for example, behaviour), this explicitly allows for testing which behavioural labels extract meaningful latents from the neural space. For instance, if ‘space’ is hypothesized to be encoded in a given neural population, close spatial distances of the animal can be used to sample the positive pairs, and far spatial distances for negative pairs. If this relationship between space and the neural data does not exist, Schneider et al. showed both empirically and theoretically that this creates an unsolvable optimization problem — the model cannot simultaneously satisfy the contrastive constraints — and the embedding collapses to a trivial solution on the hypersphere (a diffuse cloud distributed on it)54,55. Note that auxiliary variables can also be derived from other modalities such as video embeddings54.

Such joint modelling with contrastive learning generalized theory from non-linear independent component analysis to ensure identifiability of the model (Box 1). Specifically, if two models f and \(\mathop{f}\limits^{ \sim }\) trained on the same data yield the same conditional distributions over sample pairs (for example, via InfoNCE loss) (Box 1), then their embeddings are linearly related — that is, a transformation L exists such that \(\mathop{f}\limits^{ \sim }(x)=Lf(x)\) for all x in the data set. This identifiability ensures that downstream tasks relying on these embeddings will behave consistently across (different) model instantiations. This consistency enables robust use in downstream tasks such as decoding or topological data analysis, and facilitates cross-participant or cross-modality alignment. As it only requires that latent variables vary sufficiently over time, CEBRA provides a flexible framework for analysing complex neural data (whether spikes or imaging) or behaviour, and can recover latent trajectories aligned with meaningful experimental variables under mild assumptions.

New work has extended this framework to include explicit temporal dynamics priors. Dynamic contrastive learning has incorporated explicit modelling of the SLDS to extract hypothesis-guided dynamical systems from neural data57. MARBLE also leverages contrastive learning, but first preprocesses the neural activity via geometric deep learning approaches into manifold embeddings58. By doing so, they implicitly incorporate similarity through the similarity of spiking patterns over time. The limitation is that enforcing a specific geometry may restrict flexibility in capturing latent neural dynamics that do not conform to the assumed manifold structure.

Another key feature of these approaches is the identifiability of the models (Box 1). As we discussed in this section, contrastive learning with auxiliary variables can uniquely recover models when networks are bijective under noise-contrastive estimation (NCE) loss, and with InfoNCE loss the bijectivity assumption is sometimes unnecessary53,54,55,57. Notably, identifiability can also be achieved with VAEs (under more specific generative model assumptions). For the relevant neuroscientific case of Poisson noise, this was carried out in PI-VAE59. PI-VAE built on advances in identifiable VAEs60 (Box 1) to develop a method that outperformed LFADS, VAEs and pfLDS44 in predicting the latent variables in the underlying data, namely the position of a rat navigating on a linear track. Follow-up work extended this to better incorporate temporal information via CNNs (Fig. 1b), with conv-PI-VAE having even higher performance54.

Behavioural analysis for neuroscience

Lightly adapting Lord Kelvin’s dictum, one may quip that ‘what you cannot measure, you cannot understand’. Consider a natural scene where various species engage in their daily activities (Fig. 3a). With our advanced primate sensory and cognitive systems, we can effortlessly extract rich semantic information from this environment: identifying the different species, characterizing the sounds they produce, interpreting their behaviour, and even detecting nuanced social dynamics such as the attentive gaze of a mother monitoring her young. As we outline below, current behavioural analysis systems are comparatively limited.

Fig. 3: Hierarchical behavioural analysis.
figure 3

In natural scenes where various species engage in their daily activities, current analysis systems are comparatively limited in transforming animals’ behaviour into rich, structured data streams that enable straightforward enquiry through simple human-interpretable queries. a, Problem setting and solutions: localization, pose, action understanding, re-identification148 and scene-level annotations. b, Hierarchical decomposition of behaviour in a mouse, spanning three levels: activities, actions and motion primitives. At the highest level, the mouse performs three activities: self-care, Freudensprung (joy jump) and social interaction. Each activity comprises multiple actions — self-care, for instance, involves grooming and sitting upright. These actions further break down into elementary motion primitives that constitute the building blocks of movement. Drawing in a by Julia Kuhl.

Behaviour is inherently hierarchical, comprising nested sub-routines61,62,63,64, and often is not clearly discrete but, rather, continuous in nature (Fig. 3b). For example, the behaviour of a mouse colony’s social dynamics is characterized by many part-to-whole relationships both across space (from the entire colony, to individual family units, to specific mice, to their whiskers and forelimbs) and time (from seasonal reproductive cycles, to brief courtship interactions, to momentary investigative sniffs).

Ultimately, we believe that behavioural analysis systems should aim to capture this comprehensive and continuous, behavioural landscape (Fig. 3b). We advocate that the goal is to transform animals and their environment (or ecosystem) into rich, structured data streams that enable straightforward enquiry through simple human-interpretable queries. Just as a video game designer has perfect knowledge of what virtual agents perceive and how they respond to their environment, we should strive for similar insight into animal behaviour in experimental contexts.

Many of the variables we seek to measure can be inferred well from cameras (sometimes other modalities are more appropriate, but the deep learning methods work similarly) (see the section ‘Towards hybrid objectives and multimodal modelling’ for discussion of multimodality). One of the foundational (machine learning) tasks is animal detection (localization). This can be done by training detectors65,66,67, which infer bounding boxes around each individual or simple vision transformations. The latter approaches work well when the contrast is high68,69. One can also jointly estimate the location of multiple body parts, rather than just infer the body’s centre or the bounding boxes. Such pose estimation algorithms distil the geometric configuration of the animal’s body into a few user-defined keypoints70. With these methods, the locations of other objects or individuals can be inferred, thus enabling the study of how animals interact with their environment. Pose estimation is mature, widely used tools are openly available71,72,73,74,75,76 and users can improve the performance of their tailored networks by adapting the augmentation pipeline70, using post-processing or using specialized methods for crowded scenes77 (reviewed elsewhere2,7,70).

Although these tailored, specialist models extract pose within user-defined contexts, recent unified models provide keypoint spaces that work robustly across species and settings with strong zero-shot performance75, or serve as stronger initializations than standard transfer learning71 when training is necessary. Similarly, for animal detection, MegaDetector78 or Segment Anything66,67 excel at localizing and segmenting animals across videos without annotation.

Moving beyond 2D estimation, users may want to exact kinematically accurate estimates with three dimensions and even merge this with biomechanical modelling. 3D pose estimation is (typically) achieved through multiple calibrated cameras72,79,80,81,82,83, depth cameras84,85,86 or a single camera87,88,89. From a single camera one applies lifting methods, either directly from 2D pose sequences87,88,89 or with end-to-end trainable pipelines that combine multiple steps, but can even achieve excellent results for complex cases such as hand–object interactions90,91. We discuss new avenues for merging 3D pose and biomechanics (see the section ‘Towards hybrid objectives and multimodal modelling’).

After 2D or 3D pose extraction and tracking across time, activities, actions and motion primitives (Fig. 3b) — behaviours — are identified using three approaches: rule-based, supervised and unsupervised. Rule-based analysis defines behaviours through measurements — for instance, tracking head versus body keypoints enables defining heading angle and ‘look right’ behaviours, whereas tracking two mice allows defining ‘following’ heuristics. This simple yet powerful approach is widely implemented (for example, Live Mouse Tracker)92. Large language models can help researchers to write such rule-based analysis code93 (Box 2).

For supervised behavioural analysis, annotated examples of behaviour are obtained and then a classifier is trained. This classifier can operate on pose, video frames or many other modalities94,95,96. Owing to the widely available pose estimation tools, various approaches have been developed to predict behaviour from pose tracking data73,97,98,99,100,101,102. More generally, in computer science, the related task of action recognition has seen a lot of progress, due to large-scale benchmarks103 and advances in model architectures, including foundation models104,105,106,107,108 (Box 2).

For unsupervised methods, various computational approaches are widely used to decompose behaviour into ‘syllables’84,85,109,110,111,112. However, these models typically operate on a single timescale, which can be either an implicit or explicit parameter85. In unsupervised representation learning competitions for behavioural analysis, such as MABe22 (ref. 99), adapted variants of BERT113, Perceiver34, TS2Vec114 and PointNet115 initially reached the best results. In addition, AmadeusGPT93 performed well in generating rule-based analysis code from natural language user input via language models. Hierarchical masked autoencoding-based methods (hBehaveMAE)116 and contrastive methods integrating multiple timescales, such as bootstrap across multiple scales117, later reached better performance both for identifying social actions and genotype and environmental conditions.

Of course, it is (relatively) straightforward to collect a large amount of videos of animals in experiments. However, annotating these data is time consuming, costly, requires a lot of knowledge, is error prone and is subject to biases10,11,118. To develop better methods, larger data sets that annotate behaviours of interest need to be created. Here, one could also leverage published work, where the behaviour was annotated manually. Another important direction that demonstrates the power of emerging approaches is the creation of synthetic data based on simulators116,119. For example, due to the scarcity of large-scale hierarchical behavioural benchmarks, Stoffl et al.116 created a synthetic basketball playing benchmark (Shot7M2) and could show that hBehaveMAE learns interpretable behavioural latents on Shot7M2 as well as non-synthetic data sets.

Why infer all these variables when many — especially high-level behavioural inferences — are perhaps subjective and difficult to validate? Neural data offer one of the most objective metrics for assessing these measurements. The critical question is whether one can identify corresponding neural signatures in the brain. Do these signatures map hierarchically onto the circuits that generate behaviour in a hierarchical manner?

This capability would be transformative for neuroscience, where linking neural activity to naturalistic, hierarchical behaviour remains a central challenge. By providing a comprehensive behavioural read-out across multiple timescales and organizational levels, such systems would enable neuroscientists to correlate brain activity with precise behavioural events, states and decisions — dramatically advancing our understanding of neural coding, sensorimotor integration and the neural bases of behaviour. Future multimodal brain–behaviour models could tackle this.

Towards hybrid objectives and multimodal modelling

We propose a taxonomy of supervised, generative and contrastive models that can operate on neural or behavioural data alone, or jointly across modalities. Although these categories provide a useful scaffold, modern machine learning increasingly combines elements from multiple paradigms, incorporates pretrained features and trains on heterogeneous data sets (for example, CEBRA with DINO embeddings). This shift reflects a broader trend in AI: moving beyond narrowly defined tasks towards models that learn shared latent representations across diverse data streams and tasks. In neuroscience, this raises the question of whether joint brain–behaviour models might evolve along similar lines to recent successes in multimodal AI, such as vision-language models (Box 2).

In parallel to advances in neuroscience for neural and behavioural analysis, recent advances in AI, particularly in vision-language modelling, have shown the power of learning joint latent representations across modalities without assigning one as primary and others as auxiliary. Notable examples are so-called vision-language models, which are (so far) primarily used outside neuroscience. Bai et al.120 proposed an early vision-language model that combined BLIP121 (which jointly optimizes three objectives: image–text contrastive for aligning image and text embeddings; image–text matching for determining whether a caption matches an image; and language modelling for generating captions or answers from visual input) with the Qwen large language model122 (which processes visual tokens as input to the language model). Such models learn shared latent spaces by aligning visual and language streams through contrastive or generative pretraining123,124,125. These architectures capture rich semantic relationships by simultaneously encoding and decoding across modalities, offering a compelling blueprint for future neuroscience models.

In addition, the use of new AI tools for behavioural measurement has expanded rapidly in recent years. As we aimed to highlight, moving to hierarchical measurements of behaviour, and even mapping pose to biomechanical models, is now possible89,126,127,128. Namely, given a biomechanical model, one can imitate recorded 3D pose estimation data and infer muscle dynamics via physics simulations129. Naturally, inferring those (latent) variables is crucial for modelling the somatosensory, proprioceptive and motor systems, and several recent studies are at the interface of motion capture, biomechanics and neuroscience89,126,127,128. Thus, these higher-dimensional behaviour variables will be critical to reveal biological insights with joint modelling approaches.

Inspired by this, we believe that the next-generation hybrid objective models in neural data should move beyond the conventional encoder–decoder pipeline or single-modality supervision. Rather than treating spikes as outputs and behaviour as labels, or vice versa, truly multimodal neural models can learn embeddings that simultaneously predict, align and reconstruct multiple streams: spiking activity, behavioural videos or other task-related stimuli. This likely requires objective functions that integrate self-supervised contrastive, generative and reconstruction-based losses, enabling models to reason jointly about neural dynamics, internal states and externally observable behaviours. Specially, future approaches may incorporate latent dynamics with high-dimensional output modelling, where the goal is to reconstruct visual stimuli or even the biomechanical level of behaviour given neural recordings, or vice versa. Such tasks will benefit from architectural innovations beyond transformers or state-space models (Fig. 1). Although those generic architectures scale efficiently to long sequences, it is still active research in machine learning to tailor such multimodal networks to input–output multiple tasks with high performance. Also, new architectures tailored to spatio-temporal structure in neuroscience data might need to be considered. These hybrid frameworks may lead to foundation models (Box 2) that infer shared latent spaces of perception and action, enabling generalization across tasks, individuals and experimental settings.

Here, we also briefly link to data-driven and task-driven models of the brain. Work in this field also leverages the power of AI, but to explicitly build models of brain function for hypothesis testing and making discoveries (reviewed previously3,26,130). For example, recently Wang et al.131 developed a data-driven foundation model for the primary visual cortex of mice that is trained to predict spiking activity in multiple areas of the brain from measured behaviour, such as a video stimulus (animal-viewed) and pupil direction and diameter. They showed that this model generalizes to predict the response to classic visual stimuli (which was not possible before), and the responses in other mice. Notably, this model demonstrates the ability to predict cell types and anatomical areas131, illustrating the potential for multimodal applications.

Trustworthy, interpretable and performant joint models

As joint models become more central to neuroscientific discovery, we argue that it is no longer sufficient to benchmark solely on performance in spike prediction or behavioural decoding. Instead, we must systematically assess mechanistic interpretability metrics, such as ‘consistency’, ‘identifiability’ and ‘robustness’ of the models — core properties that reflect whether models yield reproducible, interpretable representations across runs, data sets and participants (Box 1). These criteria are essential for building trustworthy and scientifically useful models. Thus, future benchmarking efforts should also focus on trustworthiness and interpretability in joint brain–behaviour models, and we propose a scorecard to help shape these efforts (Table 1).

Table 1 Scorecard for joint brain–behaviour models

Trustworthiness derives from consistency, identifiability and robustness. Consistency across runs measures the stability of embeddings or predictions when models are retrained with different random seeds or data subsets, ensuring reproducibility54,132. Identifiability evaluates whether latent representations can be uniquely recovered up to simple transformations (for example, linear mappings) across sessions or individuals, crucial for meaningful cross-data set comparisons51,54. Robustness to noise and perturbations quantifies sensitivity to input corruption, missing data or adversarial attacks, highlighting model reliability under real-world conditions133 (Table 1). Although this is often not considered in neuroscience research, in real-world neurotechnology applications such as BMIs there is growing recognition of such issues.

Interpretability considers whether the learned features are both human-interpretable and mathematically explainable — whether attribution methods such as Shapley values or saliency maps provide consistent and faithful explanations of model decisions that generalize across data sets134,135. Moreover, recent work to expand explainable AI methods with theoretical guarantees in the time domain are emerging55,136. In addition, how well learned latent spaces correspond across different modalities (such as neural activity and behaviour) — cross-modal alignment — can be assessed137,138,139. Evaluating models in these additional dimensions could greatly aid in both tool selection for researchers, and for pushing the field to develop more interpretable models.

A related line of interpretability work are methods and metrics that have been developed to compare representations (Table 1). Classical methods for comparing neural population dynamics include canonical correlation analysis, which identifies linear projections that maximize shared variance between data sets140, and representational similarity analysis, which compares pairwise dissimilarity matrices of neural responses141. Centred kernel alignment was later introduced into machine learning to robustly compare representational spaces, even across layers of deep networks, and has since been shown to be mathematically related to representational similarity analysis under certain conditions142,143. Emerging methods include shape metrics, which is a very promising approach proposed by Barbosa et al.144 and Williams et al.145. In brief, this approach builds on, and formalizes, Procrustes distances146 to quantify similarity in neural populations by evaluating explicit geometric transformations between neural trajectories, allowing flexible specification of distance measures that capture population-level neural dynamics. Another metric is dynamic similarity analysis, which is a non-linear metric that compares the spatio-temporal elements of dynamical systems147.

Open challenges

Modelling across diverse neural and behavioural data types is not without complexity. As implicitly noted in this Review, challenges arise from differences in sampling rates, modality-specific noise characteristics, and methods to both assess performance and the resulting representational geometries of the models. A major challenge is the heterogeneity of data types, including spike trains, functional MRI signals and video-based pose estimation, each having different sampling rates, noise profiles, assumptions and generative mechanisms. Developing robust frameworks that can handle asynchronous, incomplete and noisy multimodal data streams remains a critical challenge.

As experimental paradigms become more naturalistic, the number of relevant behavioural measurements and variability (might) grow substantially. This creates a fundamental tension: more realistic behaviours require more complex models, but limited data necessitates simpler approaches to avoid overfitting. Cross-session modelling can help here. However, although we can now train powerful models across multiple sessions, they rely on strong assumptions. How can this be done correctly when inputs and computations vary across trials, sessions or behavioural contexts?

Model selection also remains an open problem; particularly, when ground truth latent states are unavailable, it becomes challenging to know whether the learned latents are meaningful. To aid in this, we argue that traditional metrics such as reconstruction error or decoding accuracy must be supplemented with measures such as explainability, robustness and representational similarity (Table 1). Model selection could also involve leveraging activity recorded in other brain areas. Indeed, inferring putative unmeasured inputs such as sensory inputs, neuromodulatory signals or those from upstream brain areas is a major open challenge.

Notably, interpretability is a critical challenge. Deep learning models, particularly large-scale transformers and multimodal foundation models, may not produce human-interpretable latents. As these models grow in complexity, their outputs risk becoming disconnected from mechanistic insight unless constrained by priors or structured inductive biases grounded in neuroscience.

Conclusions

In summary, we synthesized recent advances in joint modelling of neural and behavioural data, with a focus on methodological innovations, scientific and engineering motivations, and key areas for future innovation. Specifically, we discussed innovations in discriminative, generative and contrastive joint models and recent advances in behavioural analysis methods, including pose estimation and hierarchical behaviour analysis. In addition, we argued that traditional metrics such as the reconstruction error or decoding accuracy must be supplemented with measures such as explainability, robustness and representational similarity. We believe that their incorporation will yield new hybrid approaches that can leverage the rich diversity of behaviour, but also allow for new principles of neural coding to be uncovered.

Joint brain–behaviour modelling is rapidly reshaping the ability to understand how neural dynamics generate complex behaviour. Looking ahead, the fusion of discriminative, generative and contrastive approaches, large-scale neural recordings and multimodal behavioural measurements from high-level behavioural states to biomechanics promises not just better prediction but conceptual breakthroughs. Moving beyond joint models that capture the latents of neural dynamics as shaped by behaviour, future models may begin to uncover new mathematical principles of neural computation. Can emergent laws that describe how dynamic neural systems encode, transform and act on information be discovered?

The most exciting frontier lies in discovering emergent laws that describe how dynamic neural systems encode, transform and act on information — principles that might be as fundamental to neuroscience as conservation laws are to physics. As computational power grows and data become richer across ecological contexts and diverse species, we anticipate that the next generation of embodied, situated and hierarchical models will not merely simulate brain function but also reveal the organizing principles that make adaptive intelligence possible. By embracing the full complexity of natural behaviour while grounding our models in the physical reality of bodies moving through environments, we believe the field stands at the threshold of a new synthesis — one that will transform both our understanding of biological intelligence and our ability to create artificial systems that exhibit truly adaptive, flexible behaviour. The challenge ahead is not just technical but conceptual: can we develop theoretical frameworks powerful enough to bridge the gap between the richness of natural behaviour and the elegance of fundamental principles? We are optimistic that the answer is yes, and we look forward to contributing to this transformative journey.