Introduction

Digital twin technology represents a paradigm shift in precision livestock farming, transforming how dairy operations monitor, analyze, and optimize animal management through the creation of virtual replicas of physical farm entities1,2. These advanced systems integrate real-time sensor data, behavioural patterns, physiological parameters, and environmental conditions to enable continuous monitoring, predictive analytics, and automated decision-making at scales and levels of precision previously unattainable3,4. In the dairy sector, digital twins combine behavioural monitoring, mechanistic physiological modeling, nutritional requirement assessment, and environmental impact analysis to generate individualized management strategies that improve both animal welfare and production efficiency5. Empirical evidence supports these benefits: digital twin deployment can enhance feed conversion efficiency by 15–20%, reduce veterinary interventions by 25–30%, and cut greenhouse gas emissions per unit of milk by as much as 25%2,6,7. These outcomes illustrate the potential of digital twins as transformative tools for addressing the economic, environmental and welfare challenges facing modern dairy production. Robust and continuous behavioural monitoring thus serves as the foundation upon which these digital twins operate, providing dynamic, high-frequency data to model individual cow states and management outcomes.

A central prerequisite for digital twin functionality is accurate and continuous behavioural monitoring. In dairy cattle, five behavioural categories like feeding, drinking, lying, standing, and ruminating serve as critical indicators of metabolic balance, reproductive status, disease onset, and welfare status8. Feeding behaviour is closely tied to dry matter intake and energy balance, and its measurement is essential for nutritional modeling and ration optimization. Rumination patterns provide direct insights into digestive function and metabolic efficiency, with deviations often preceding metabolic disorders such as acidosis, ketosis, or displaced abomasum5. Lying behaviour is a proxy for cow comfort and lameness risk; research has established that 10–14 h of daily lying is optimal for both productivity and welfare outcomes9. Drinking behaviour, while less frequent, is vital for assessing hydration, feed palatability, and heat stress, all of which impact milk yield and welfare10. Even standing, often overlooked as a passive activity, can indicate restlessness, discomfort, or estrus expression when measured systematically. Automated recognition of these behaviours is therefore indispensable for creating the continuous, structured data streams that feed into digital twin models1,3. While the biological relevance of these behavioural metrics is well established, the challenge lies in capturing them continuously and objectively in commercial farm environments.

Conventional behavioural monitoring methods, however, present significant limitations that restrict their scalability and accuracy. Manual observation, though capable of capturing fine-grained detail, is labor-intensive, subjective and economically unsustainable for commercial herds that often number in the hundreds. Even among trained observers, inter-observer variability can reach 15–25%, reducing the reliability of datasets essential for predictive modeling8,11. Wearable sensor systems, while offering continuous monitoring, face a series of practical and welfare-related challenges. Device loss rates of 5–15% create gaps in longitudinal tracking. Batteries require replacement every 3–6 months, imposing labor costs and necessitating animal handling that can induce stress. Calibration drift reduces measurement reliability, while sensor placement may alter natural behaviours, potentially biasing the very data intended for welfare and productivity optimization5,7. These shortcomings underscore the need for alternative approaches capable of generating high-quality, scalable behavioural data.

Computer vision has emerged as the most promising solution to these limitations, offering a scalable, non-invasive method for monitoring livestock at both individual and group levels. Barn-based video systems enable continuous monitoring of multiple animals simultaneously without interfering with their natural behaviours, while providing richer postural and temporal information than wearable sensors. Video analytics can capture body orientation, head movements, and activity transitions that are critical for distinguishing behaviours such as feeding versus ruminating. Importantly, video-based systems scale cost-effectively, monitoring dozens of animals with a single camera, and integrate easily with existing farm surveillance infrastructure2,12. Furthermore, standardized output formats from computer vision pipelines facilitate integration with farm management software, nutritional modeling tools, and digital twin frameworks.

Recent advances in deep learning architectures have dramatically improved the accuracy and robustness of behaviour recognition systems. State-of-the-art models have achieved 85–95% accuracy in multi-class livestock behaviour classification tasks3,11,13. Transformer-based models, originally developed for natural language processing and now adapted for video understanding, show promise for long-duration behaviours. For example, transformer-based classifiers have reached over 90% accuracy in beef cattle behaviour recognition, outperforming convolutional networks in capturing spatiotemporal dynamics11. Recent advances in deep learning have improved livestock behaviour recognition, with state-of-the-art models reporting 85–95% accuracy3,10,12. Transformer-based models show particular promise for long-duration behaviours10. Yet most published systems remain focused on isolated recognition rather than operational digital-twin workflows: identity persistence, structured outputs, and real-time performance under barn variability are rarely addressed. Bridging this gap requires unified pipelines that connect video analytics to nutritional and management models via standardized, digital-twin-ready behaviour streams.

Several challenges continue to constrain the scalability and robustness of current livestock video-based monitoring systems. Benchmark datasets are typically small (less than 1000 clips) and collected under controlled conditions, limiting generalization to commercial barns11,14. The largest publicly available dairy cattle dataset, CBVD-5, contains only 687 clips, insufficient for training deep learning models with robust generalization capability. Moreover, many studies report only offline evaluation results, with real-time performance rarely demonstrated. Yet, for digital twins to function as decision-support tools, latency must remain below 200 ms to provide continuous updates for nutrition, health, and management interventions1,12. Current systems often lack structured outputs, providing only categorical classifications rather than temporally annotated behavioural profiles linked to individual animal IDs. Such structured data streams are critical for downstream integration with nutritional models (e.g., NRC equations), predictive health algorithms, and farm management platforms. Finally, while transformer-based architectures such as TimeSformer have shown strong performance in other domains, their systematic evaluation in livestock contexts remains limited3,11. These gaps highlight the need for comprehensive evaluation frameworks that incorporate real-world datasets, real-time performance, and standardized data outputs tailored to digital twin requirements.

Beyond technical performance, practical deployment factors must also be considered. Barn environments present variable lighting, occlusion, and crowding, all of which complicate computer vision performance. Night-time monitoring often produces reduced accuracy due to low-light conditions, while occlusion from feeding barriers or overlapping animals can obscure key anatomical features necessary for classification. Robust systems must therefore incorporate augmentation strategies and multi-angle camera setups to generalize effectively across such variability. Interpretability is another crucial requirement for industry adoption. Attention mechanisms that highlight anatomically meaningful regions, such as the head for feeding or the legs for lying, provide confidence to farmers, veterinarians, and regulators that the models’ predictions are biologically valid8,10. Beyond model performance, scalability and cost-effectiveness remain decisive factors for commercial uptake. Systems that function in real time on widely available GPU hardware without excessive computational demands are more likely to achieve practical deployment in dairy barns.

Collectively, these considerations make clear that video-based behaviour detection is not only a technical challenge but also a linchpin for the broader adoption of digital twins in dairy farming. By enabling continuous, individualized monitoring, computer vision provides the behavioural data streams required to drive mechanistic nutritional modeling, predictive health analytics, and environmental impact optimization. Without robust and scalable behaviour detection systems, the promise of digital twins for livestock remains unattainable.

This study directly addresses these gaps through the development of a comprehensive video-based cattle behaviour detection system explicitly designed for digital twin integration. We present a large-scale dataset collected under authentic barn conditions, incorporating natural environmental variability. An optimized real-time processing pipeline is implemented, combining YOLO detection and ByteTrack tracking for persistent individual cow identification with comparative evaluation of state-of-the-art SlowFast and TimeSformer architectures for behaviour classification. Structured outputs, including temporally annotated behaviour logs with cow IDs and event durations, are generated in formats compatible with nutritional modeling frameworks and Unity-based visualization systems. By systematically benchmarking performance across models and validating interpretability through spatiotemporal attention analysis, this work establishes a robust technical foundation for integrating behaviour detection into dairy digital twins. In this study, we explicitly position the proposed system as the behavioural perception and state-estimation layer of a dairy digital twin rather than a fully realized, closed-loop cyber physical system. The focus of this work is on reliable, real-time extraction of structured, per-animal behavioural streams including identity, behavioural state, temporal boundaries, and duration from continuous video data under commercial barn conditions. These outputs constitute the foundational inputs required by downstream nutritional, physiological, and visualization modules. Higher-level optimization, control, and actuation components of a complete digital twin are considered outside the scope of this paper. The objectives of this study are therefore to: (1) construct a large-scale video dataset of core cattle behaviours under commercial barn conditions; (2) develop a real-time multi-animal tracking and classification pipeline; (3) conduct a systematic comparison of SlowFast and TimeSformer architectures; and (4) generate structured, biologically interpretable outputs tailored for digital twin applications in precision dairy systems.

Methods

Experimental environment and data collection

The study was conducted at the Ruminant Animal Centre (RAC) of Dalhousie University’s Agricultural Campus (Truro, Nova Scotia, Canada), a research facility dedicated to advanced dairy production, nutrition, and management studies. Each Holstein cow is tethered in an individual stall with a lying space, a front feed manger and an adjacent in-stall water bucket. Locomotion is constrained to standing and lying within the stall and the cows access feed and water without leaving the stall.

Environmental control within the barn is achieved through a combination of natural and mechanical ventilation systems. Adjustable sidewall curtains allow air exchange based on external weather conditions, while chimneys with exhaust fans and circulation fans maintain airflow and temperature uniformity. During warmer months, additional ventilation fans are directed toward the cows to alleviate heat stress. The barn follows a 19-h light and 5-h dark cycle, with lights turning off at 10:00 p.m. and on at 3:00 a.m. to align with the milking schedule and support circadian rhythm balance.

The RAC herd currently consists of approximately 80 Holstein dairy cows, of which 40 are actively lactating. The animals represent a range of parity and lactation stages, enabling balanced behavioural observations across physiological conditions. Milking is performed twice daily, at 4:30 a.m. and 4:00 p.m., consistent with commercial dairy practices in Atlantic Canada. Cows receive a total mixed ration (TMR) composed of grass silage, corn silage, straw, and concentrate. The formulation is adjusted regularly based on protein and energy analyses of the forage components and the production stage of each group; non-lactating and dry cows receive a lower-energy TMR variant to maintain optimal body condition.

A high-definition closed-circuit surveillance system was installed to enable continuous behavioural monitoring of the dairy herd. The system consisted of a total of seven Panasonic IP cameras, each configured to record at 1920 × 1080 resolution with a frame rate of 25–30 fps.

Six of the units were Panasonic WV-S35302-F2L 2MP Outdoor Vandal Dome Cameras, equipped with 2.4 mm fixed lenses, infrared (IR) illumination for low-light monitoring, and integrated microphones for capturing ambient sound. These cameras are IP66 and IK10 rated for environmental durability and impact resistance, and are compliant with FIPS 140-2 Level 3 standards, ensuring secure data handling.

One additional Panasonic WV-X15700-V2L 4 K Outdoor Bullet Camera was installed in a high-activity zone. This unit featured a 4.3–8.6 mm motorized zoom lens and an embedded AI engine capable of supporting up to nine analytic applications simultaneously, enabling high-resolution tracking of fine-grained interactions.

Cameras were mounted at heights of 3–4 meters using Panasonic WV-QWL500-W wall brackets and connected via Proterial 61337-8 CAT6A armored plenum-rated cables. Strategic placement at overhead and angled perspectives ensured coverage optimization, overlapping fields of view, and minimization of blind spots, particularly around feed bunks, water troughs, and lying stalls.

The system provided continuous 24/7 monitoring across both day and night cycles, with infrared capabilities enabling uninterrupted observation. To safeguard against data loss, cameras were supported by an UltraTech 1000 VA/600 W uninterruptible power supply (UPS), ensuring reliability during power fluctuations.

Dataset construction

The raw surveillance footage obtained from the multi-camera system was initially subjected to a systematic preprocessing pipeline to ensure its suitability for downstream behavioural analysis. The video streams were first screened manually to identify behaviour rich segments containing clear examples of feeding, drinking, lying and standing. Segments with excessive occlusion, poor lighting or limited behavioural activity were excluded to maintain spatial and temporal consistency across the dataset. This filtering step followed established practices in large-scale livestock video analysis, where minimizing noise is essential for reliable behavioural inference15. This pipeline is depicted in Fig. 1.

Fig. 1: End-to-end framework for video-based cattle behaviour detection and digital-twin integration.
Fig. 1: End-to-end framework for video-based cattle behaviour detection and digital-twin integration.
Full size image

Continuous video streams from barn cameras are processed through a pipeline comprising (1) object detection and multi-cow tracking (YOLOv11 + ByteTrack), (2) behaviour classification using deep spatiotemporal models (SlowFast / TimeSformer), (3) behavioural data translation into structured activity logs for nutritional modeling via NRC equations, and (4) dynamic synchronization with a Unity-based 3D digital-twin environment for visualization and decision support in precision dairy management.

Following quality filtering, long video sequences were segmented into shorter, behaviour-specific clips using a semi-automated workflow. Cows were localized in each frame using a YOLOv11 based object detection model16, which was trained on a custom dataset of 2308 images, each labeled with bounding boxes around individual Holstein cows. The dataset was divided into training (1923 images; 83%), validation (193 images; 8%), and test (192 images; 8%) splits. Preprocessing included auto-orientation and extensive data augmentation was applied during training to improve robustness. Augmentation operations17 included horizontal flips, random resized crops (0–12% zoom), small rotations between −10° and +10°, brightness adjustments (−15% to +15%), contrast and exposure variations (−10% to +10%), saturation adjustments (−25% to +25%), Gaussian blur (up to 2.5 pixels), and additive noise applied to 0.1% of image pixels. Each training image generated three augmented variants, effectively expanding dataset diversity.

The YOLOv11 model was trained for 90 epochs with a batch size of 16 and a learning rate of 0.01. Training was conducted in Google Colab Pro. The final model achieved high detection performance with mAP@50 = 0.994, precision = 0.982 and recall = 0.992, indicating reliable cow detection across varying barn conditions.

To maintain the identity of individual cows across frames and throughout video segments, the detections were linked via the ByteTrack multi-object tracking algorithm18, which has been shown to outperform traditional identity-preserving trackers in complex agricultural scenes. This ensured that each cow maintained a consistent ID across frames, even during occlusion or group interactions. Once tracking was established, per-cow bounding boxes were cropped from each frame and resized to 224 × 224 pixels, producing standardized video clips corresponding to individual animals.

All extracted video segments were standardized to a fixed duration of 10 s. This length was selected as a compromise between capturing short, transient actions such as drinking, and longer-duration behaviours such as lying or ruminating, which often span several minutes11. A fixed temporal window allowed for uniform sampling across the dataset and simplified downstream training procedures. The approach is consistent with recent work in animal behaviour recognition, which emphasizes the importance of balancing clip length with the temporal resolution required to distinguish between different behavioural states19. As a result, the preprocessing pipeline produced a curated set of behaviour-focused clips that maintained both ethological relevance and computational tractability for annotation and model development. The full workflow, from raw video to processed clips, is illustrated in Fig. 2.

Fig. 2: Video preprocessing and dataset construction workflow.
Fig. 2: Video preprocessing and dataset construction workflow.
Full size image

Continuous barn surveillance footage was processed through a multi-stage pipeline involving (1) raw video input from multiple cameras, (2) frame-wise cow detection using YOLOv11, (3) identity-preserving multi-object tracking via ByteTrack, and (4) per-cow bounding box cropping to generate standardized 10-s clips at 224×224 px resolution. This pipeline enabled the creation of a balanced, behaviour-focused dataset suitable for spatiotemporal model training and digital-twin integration.

Annotation framework

Behavioural annotation was performed on the curated 10-s clips to classify cow activity into seven categories: standing and feeding, lying and feeding, drinking, lying, lying and ruminating, standing and ruminating, and standing. These behaviours were selected because they represent the dominant components of the daily activity budget of dairy cattle and are directly linked to productivity, health, and welfare outcomes20. Each behaviour was defined according to established ethological criteria and verified against visual distinguishability standards to ensure consistent labeling.

Feeding was annotated when a cow’s head was directed toward the feed bunk, with visible engagement in feed intake11. Drinking was assigned when the muzzle was in contact with or directly above the water trough, typically accompanied by head movements consistent with water ingestion21. Lying was defined by a resting posture in which the torso was in contact with the stall surface, with limbs folded beneath or alongside the body22. Standing was annotated when the cow maintained an upright posture with all four hooves in contact with the ground but without locomotion21. Ruminating was identified primarily through cyclical jaw movements associated with cud chewing, which typically occurred during lying but was also observed in stationary standing postures. These operational definitions ensured that categories were mutually exclusive and visually distinguishable in the recorded footage.

To ensure annotation reliability, a dual-annotator protocol was implemented. Two trained annotators independently labeled all clips, and inter-rater agreement was calculated using Cohen’s kappa coefficient23, which consistently exceeded 0.95 across the dataset, indicating excellent agreement. In cases of disagreement, annotators engaged in consensus discussions to resolve inconsistencies. This multilayered annotation protocol combined ethological rigor, human oversight, and veterinary expertise, ensuring both biological accuracy and reproducibility of the dataset.

The final curated dataset comprised 4964 behaviour clips, each standardized to a fixed length of 10 s prior to augmentation. The dataset spanned multiple temporal dimensions of variation. Recordings were collected 24/7, thereby capturing fluctuations in behaviour associated with environmental changes such as temperature, humidity, and ventilation dynamics. In addition, the dataset encompassed both diurnal cycles and nocturnal activity patterns, supported by infrared camera functionality that enabled continuous monitoring during low-light conditions at night as well. Physiological variation was also represented, as cows at different lactation stages were included, thereby accounting for differences in activity budgets across productive and non-productive phases of the dairy cycle24,25.

Analysis of class distributions revealed a naturally imbalanced dataset, reflecting the time-allocation patterns typical of Holstein dairy cattle. Lying was the most frequent behaviour, comprising 24.2% of the dataset, followed by standing (18.3%), feeding (15.4%), ruminating (12.1%), and drinking (3.8%). Such distributions align with established ethological research demonstrating that dairy cows spend a substantial proportion of their daily cycle resting or lying, while drinking occupies only a small fraction of total activity26.

Although the raw dataset exhibited a natural imbalance across behavioural categories, such skewed distributions present a methodological challenge for machine learning, as minority classes like drinking and ruminating may be underrepresented in model training. To address this issue, the imbalance was explicitly corrected in subsequent preprocessing steps through targeted data augmentation27,28. These augmentation strategies expanded the representation of rare behaviours while maintaining the integrity of majority classes, resulting in a more balanced dataset that better supports model generalization. By correcting imbalance at the data preparation stage, the training corpus was aligned both with the ecological diversity of cow behaviours and the computational requirements for robust classification.

Data augmentation strategy

To address the limitations of the naturally imbalanced dataset and to improve the robustness of behaviour recognition models, a structured data augmentation strategy was employed. Augmentation has been shown to enhance the generalization of deep learning models by synthetically expanding training data diversity while preserving ethological validity17. In this study, augmentation was applied across spatial, photometric, and temporal dimensions, followed by selective class balancing to mitigate underrepresentation of infrequent behaviours.

Spatial augmentations were designed to reduce overfitting to fixed barn layouts and camera viewpoints. Random resized cropping was applied with scaling factors between 0.8 and 1.2, allowing the network to learn from slightly zoomed-in and zoomed-out perspectives. Horizontal flipping was incorporated to mimic mirrored viewpoints, while safe rotations limited to ±15° introduced natural variability in orientation without compromising behavioural interpretability. These operations helped the model generalize across subtle positional and angular differences that arise from camera placement or cow movement.

Photometric augmentations were introduced to address variability in lighting conditions and sensor noise. Brightness adjustments of ±20% and contrast modifications of ±15% simulated the natural fluctuations in illumination across day-night cycles and seasonal changes. Additionally, Gaussian noise with σ = 0.02 was added to replicate the visual distortions that occur under low-light or high-contrast recording conditions. By incorporating these variations, the model was trained to remain invariant to non-behavioural visual artifacts while focusing on essential cues for classification.

Examples of spatial and photometric augmentations applied to cow images are shown in Fig. 3. These include random cropping, rotations, brightness adjustments, and noise addition, which increase dataset diversity while preserving ethological interpretability.

Fig. 3: Examples of spatial and photometric augmentations applied to cow detection frames.
Fig. 3: Examples of spatial and photometric augmentations applied to cow detection frames.
Full size image

Augmentation operations included random rotation, cropping, and brightness adjustments to increase visual diversity and improve model robustness to variable camera angles, lighting conditions, and barn environments while preserving ethological interpretability. Representative augmentation operations include random cropping, rotation, brightness adjustment, and noise injection. These transformations increase visual diversity and improve model robustness to variations in camera viewpoint, illumination, and barn environment while preserving ethological interpretability of cow behaviours.

Temporal augmentations accounted for the inherent variability in behavioural tempo and duration. Frame sampling rates were varied to simulate different effective frame rates, thereby ensuring that the model learned representations that were robust to temporal resolution changes. In addition, clip duration modifications were performed within a safe margin around the standard 10-s window, allowing the model to handle both shorter clips (e.g., drinking events) and slightly extended sequences (e.g., lying or ruminating). These operations encouraged the network to capture temporal dynamics of behaviour at multiple granularities, a practice consistent with best practices in spatio-temporal modeling29,30.

Class balancing implementation

In addition to enhancing data diversity, augmentation was selectively employed to address the class imbalance observed in the raw dataset. Rare behaviours such as drinking and ruminating were deliberately oversampled through targeted augmentation factors, while more frequent behaviours were augmented conservatively. Specifically, drinking clips were augmented 7.5-fold, ruminating clips 4.5-fold, and the remaining behaviours (feeding, standing, lying) by 2-3-fold.

This strategy expanded the dataset to over 9,600 clips, resulting in a substantially more balanced distribution across the behaviours. The final dataset composition included feeding (22.9%), standing (21.9%), lying (20.8%), ruminating (18.8%), and drinking (15.6%). By mitigating imbalance while preserving the ecological realism of behaviour patterns, the augmented dataset provided a robust foundation for training behaviour recognition models capable of performing reliably in real-world farm environments. The effect of the augmentation pipeline on class balance is illustrated in Fig. 4a, b, which compares the behavioural class distributions of the unaugmented and augmented datasets. As shown, augmentation substantially increased the representation of minority behaviours such as drinking and ruminating thereby improving dataset uniformity for model training.

Fig. 4: Behavioural class distributions before and after data augmentation.
Fig. 4: Behavioural class distributions before and after data augmentation.
Full size image

A Natural behavioural distribution in the un-augmented dataset, where lying and standing dominate and drinking and ruminating are under-represented. B Rebalanced distribution after targeted spatial, photometric, and temporal augmentation, showing improved class uniformity for model training. The bar chart shows the number of video clips per behavioural class following spatial, photometric, and temporal augmentation. Targeted oversampling was applied to minority behaviours, such as drinking and ruminating, to reduce class imbalance while maintaining biologically realistic activity proportions. The resulting distribution demonstrates near-uniform representation across seven dairy cow behaviours, supporting robust spatiotemporal model training.

Deep learning architecture implementation

To evaluate spatiotemporal modeling approaches for cow behaviour recognition, two state-of-the-art video classification architectures were implemented: the SlowFast network30 and the TimeSformer model31. These architectures were selected because they represent complementary strategies for capturing both short-term motion dynamics and long-range temporal dependencies, which are essential for distinguishing between rapid actions such as drinking and extended behaviours such as lying or ruminating.

The SlowFast network employs a dual-pathway design, consisting of a slow pathway that processes frames at a low temporal resolution and a fast pathway that captures finer temporal granularity. In this study, the slow pathway operated on 8 frames per clip sampled at fixed intervals, while the fast pathway processed 32 frames per clip at a reduced spatial resolution of 224 × 224 pixels. This configuration allowed the slow pathway to capture global semantic context such as posture (e.g., lying vs. standing), while the fast pathway focused on short-term dynamics such as head movements during feeding or drinking.

The two pathways were connected via lateral fusion layers, enabling information flow between slow and fast branches. These connections ensured that fine-grained motion features extracted by the fast branch enriched the semantic representations of the slow branch. Both pathways used a 3D convolutional backbone (ResNet-style), with spatiotemporal kernels applied across the frame sequence to jointly model motion and appearance. Computational efficiency was achieved by applying fewer channels in the fast pathway (1/8 of the slow pathway), reducing redundancy while preserving motion sensitivity. This design is particularly suited to livestock behaviour analysis, as it mirrors the temporal scales of dairy cow activity budgets: fast dynamics (head/muzzle motion during feeding or drinking) and slow dynamics (postural changes such as lying or standing)30.

The TimeSformer model adopts a transformer-based architecture that replaces 3D convolutions with divided spatiotemporal self-attention mechanisms. Each input video clip was decomposed into non-overlapping patches, which were linearly embedded and fed into transformer encoder layers. Attention was computed separately along spatial and temporal dimensions, reducing computational cost compared to full joint attention while retaining modeling power31.

In the spatial attention module, multi-head self-attention with 8 heads was applied to capture dependencies among different regions within each frame. In parallel, the temporal attention module modeled relationships across frames, allowing the network to capture long-range dependencies such as the transition from standing to lying or sustained ruminating bouts. To ensure position awareness, learned positional encodings were incorporated, enabling the model to distinguish between patches based on both spatial location and temporal order.

A hybrid attention scheme was employed, in which layers alternated between spatial and temporal attention. This design provided the model with a global receptive field across both space and time, while maintaining computational tractability. By leveraging self-attention, the TimeSformer was able to integrate contextual information across the entire clip, making it especially effective for recognizing behaviours characterized by subtle, distributed cues.

Training configuration and optimization

All experiments were conducted on a cloud based GPU platform (Google LLC, USA), which provided access to an NVIDIA L4 GPU (24 GB VRAM) for model training. This cloud-based setup ensured sufficient memory and computational capacity for handling spatiotemporal video architectures such as 3D CNNs and transformers. Local preprocessing, annotation handling, and lightweight experiments were performed on a MacBook Air with Apple M3 chip (8-core CPU, integrated GPU, 16 GB unified memory).

The computational environment was configured on Ubuntu 20.04 LTS with CUDA 11.6 and PyTorch 1.12.0 serving as the primary deep learning framework. The environment is also equipped with standard libraries for computer vision and video processing, including OpenCV 4.7, NumPy, and scikit-learn. This setup provided a reproducible and flexible platform for both development and large-scale training.

Model training was tailored to the architectural and computational requirements of each network. For the SlowFast network30, training was performed with a batch size of 4, using the Adam optimizer with a learning rate of 1 × 10⁻⁴. The loss function was cross-entropy, and the model was trained for 10 epochs, with extensions to 20 when necessary. Data loading employed two worker threads (num_workers = 2), with shuffling enabled to maximize diversity.

For the TimeSformer model31, a more advanced optimization scheme was required to stabilize transformer training. Each input consisted of 12 frames of 224×224 resolution, processed in batches of 4 samples, with gradient accumulation across 4 steps to achieve an effective batch size of 16. Training proceeded for 20 epochs using the AdamW optimizer, with a learning rate of 3 × 10⁻⁵, weight decay of 0.01, and a warmup ratio of 0.1 to gradually ramp learning in the early iterations. To ensure numerical stability, gradients were clipped at a norm of 1.0, and mixed precision training was enabled to improve efficiency. Early stopping was applied with a patience of 4 epochs, halting training when validation performance failed to improve.

Real-time processing pipeline

The proposed framework was implemented as a modular end-to-end video analysis pipeline, enabling automated behaviour recognition from continuous surveillance footage of dairy cows. The system integrated three major components: object detection and tracking, behaviour classification, and temporal smoothing.

Raw video input from the multi-camera surveillance system was first processed through the YOLOv11 object detection model, trained on manually annotated cow images as described in Section 2.2.1. This stage localized individual cows within each frame. The resulting detections were then passed to the ByteTrack multi-object tracking algorithm, which maintained consistent cow identities across successive frames and prevented errors due to occlusions or group interactions. This ensured that each animal’s behaviour was modeled continuously over time rather than as isolated instances.

The cropped, per-cow video clips were subsequently fed into the behaviour classification models (SlowFast or TimeSformer), which predicted one of seven behavioural states: standing and feeding, drinking, lying and feeding, lying, standing, standing and ruminating or lying and ruminating. To accommodate model input requirements, a sliding window strategy was employed, where clips of 16–32 consecutive frames were extracted with a 50% overlap and an 8-frame stride. This design captured sufficient temporal context for behaviour recognition while ensuring efficient use of computational resources.

Finally, to improve robustness against frame-level misclassifications, the system incorporated majority voting and temporal consistency enforcement. To reduce short-term state flicker in the per-cow behavioural timeline, we perform sliding-window inference with 50% overlap between consecutive windows. For each tracked cow, predicted behaviour labels are temporally smoothed using a majority vote over the most recent K = 3 overlapping windows. A behaviour transition is accepted when the smoothed label changes, which effectively requires at least two out of three consecutive windows to support the new state. In addition, very short behaviour segments are filtered using a minimum-duration threshold, and segments are terminated if the cow is not observed for longer than a predefined inactivity timeout. These parameters were selected empirically based on pilot evaluations to balance responsiveness to genuine behavioural transitions with robustness against short-term misclassifications. By combining real-time detection, identity-preserving tracking, spatiotemporal classification, and temporal smoothing, the pipeline provided a reliable end-to-end solution for continuous monitoring of cow behaviour in commercial farm environments.

The final stage of the pipeline was responsible for converting model predictions into structured outputs suitable for downstream analysis, integration with nutritional modeling frameworks, and visualization in digital twin environments. To ensure both interoperability and reproducibility, standardized formats were adopted for data storage and streaming.

All predictions were logged into structured CSV files containing the following fields: cow ID, predicted activity label, start timestamp, end timestamp, activity duration, and model confidence score. These logs enabled quantitative evaluation of behavioural activity budgets while preserving the temporal context of each prediction. Each entry was aligned with the synchronized video timeline, ensuring that results could be directly traced back to the original raw footage.

To support domain-specific applications, the output schema was designed for compatibility with NRC-based nutritional modeling frameworks, where activity budgets (e.g., feeding or lying duration) inform energy expenditure and intake calculations. In addition, the logs were formatted for seamless integration with Unity-based visualization systems, where predicted states were used to drive cow avatars in a digital twin of the barn environment. This facilitated intuitive, real-time interpretation of behavioural patterns by both researchers and farm managers.

For real-time applications, the system also incorporated low-latency streaming capabilities. Predictions were generated and transmitted in near real-time, with latency optimization achieved through GPU-accelerated inference, sliding window buffering, and asynchronous log writing. This ensured that activity updates were available within seconds of observation, enabling the framework to function not only as a retrospective analysis tool but also as a foundation for real-time monitoring and decision support systems in precision dairy farming.

The Unity-based 3D digital twin referenced in this paper represents an existing visualization environment used to consume the structured behavioural outputs generated by the proposed pipeline (cow ID, behavioural state, start-end time, duration, and confidence). A detailed description of the 3D environment, physiological modeling, and any closed-loop control mechanisms is beyond the scope of this study, which focuses specifically on the perception and behavioural state layer.

Ethics approval and consent

All management and feeding practices comply with Canadian Council on Animal Care (CCAC) guidelines and standard Canadian dairy production protocols. The RAC maintains stringent biosecurity and welfare standards, ensuring that all animals receive routine veterinary oversight, comfortable housing, and nutritional management aligned with NRC (2001) recommendations. All animal procedures were approved by the Dalhousie University Animal Care and Use Committee (Protocol #2024-026, approval date 16‑05‑2024) and complied with CCAC guidelines. The data acquisition was entirely non-invasive, involving passive image and video capture without any physical contact or intervention with the animals. The participating farm owners provided informed written consent after being fully briefed on this study’s objectives and protocols. This adherence to ethical standards ensured that animal welfare was not compromised throughout the data collection and research.

Results

Dataset characterization and distribution analysis

The annotated dataset comprised seven distinct behavioural classes extracted from continuous barn recordings. As summarized in Fig. 4, the augmented dataset achieved a near-uniform class balance following the targeted oversampling strategy described in Section 2.3.2. Post-augmentation, the behavioural proportions aligned closely with known ethological activity budgets of Holstein cows lying and standing remained the most frequent states, while drinking and ruminating increased to biologically realistic levels. This balanced representation provided a stable foundation for downstream model training, ensuring that minority behaviours were adequately represented without distorting the natural behavioural hierarchy.

The overall behavioural distribution closely aligns with known diurnal activity patterns of Holstein cows under tie-stall housing conditions. Lying and standing behaviours were predominant during non-feeding hours, reflecting periods of rest and comfort32. Feeding activity peaked during scheduled feed deliveries (typically morning and late afternoon), while ruminating bouts were interspersed throughout the day, particularly following feeding episodes24. Drinking events occurred less frequently but were distributed evenly across the photoperiod33. These observations confirm that the collected dataset reflects biologically realistic behavioural proportions rather than sampling artifacts.

The augmentation pipeline effectively expanded the training set size while preserving spatiotemporal consistency and behavioural realism. Augmented samples retained anatomical fidelity (e.g., head alignment with feed bunks or water troughs) and contextual cues (stall boundaries, lighting gradients) critical for accurate model training. Quantitatively, class balance improved from an initial max:min ratio of ~9:1 to approximately 2:1, reducing overfitting risk and improving minority-class recognition during preliminary validation runs.

Collectively, these preprocessing and augmentation procedures produced a balanced, ecologically valid dataset capable of supporting robust training of deep learning models for fine-grained cow behaviour recognition.

Model performance comparative analysis

The comparative evaluation of TimeSformer and SlowFast models aimed to determine the optimal spatiotemporal architecture for recognizing complex cattle behaviours under realistic barn conditions. Both models were trained and validated on the same curated seven-class dataset encompassing Drinking, Feeding and Lying, Feeding and Standing, Lying, Ruminating and Lying, Ruminating and Standing, and Standing. Each model was trained for 10 epochs using Adam optimization (learning rate = 1 × 10⁻⁴) and identical augmentation pipelines (horizontal flip, temporal jittering, illumination normalization). The dataset was split into 70% training, 15% validation, and 15% testing sets.

Across all evaluation runs, the TimeSformer achieved a mean overall accuracy of 85.0%, exceeding the SlowFast baseline accuracy of 82.3%. This improvement reflects the transformer’s inherent strength in capturing long-range temporal dependencies via its divided space-time self-attention mechanism. By encoding each video clip as a series of patch-level tokens, TimeSformer effectively attends to extended motion sequences such as continuous feeding or rumination cycles. The SlowFast model, although powerful in detecting high-frequency actions like drinking or head-lifting, relies primarily on convolutional hierarchies that emphasize localized motion and hence can miss broader behavioural context when frame-to-frame differences are minimal.

To ensure the robustness of the observed performance gap, five independent trials were executed using different random initializations. The mean accuracy difference of 2.7 percentage points was statistically significant (p = 0.031, α = 0.05) based on a paired-sample t-test. This reproducibility underscores the stability of the transformer encoder for real-world deployment, where models must maintain reliable behaviour detection despite day-to-day environmental variability. The superior accuracy of TimeSformer therefore validates the hypothesis that attention-driven global context modeling is essential for continuous barn surveillance, in which many cow behaviours manifest over extended time horizons rather than through abrupt motion cues alone.

A fine-grained comparison of precision, recall, and F1-scores () provides further insight into class-specific performance trends. Table 1 summarizes the observed metrics and the corresponding improvements (ΔF1) obtained by the transformer-based model.

Table 1 Per-class F1-scores for SlowFast and TimeSformer models across seven cattle behaviour categories

The largest improvement was recorded for Standing followed by Ruminating & Standing. These behaviours exhibit limited limb motion but prolonged duration, where the attention mechanism excels at integrating subtle temporal features - such as slight head tilts or jaw motions spread over many frames. Conversely, modest F1-score reductions for static postures (Lying, Ruminating & Lying) suggest that when motion cues are almost absent, TimeSformer’s patch-level embeddings may average out fine-grained micro-movements.

Macro-averaged F1 improved from 0.827 (SlowFast) to 0.841 (TimeSformer), while the weighted F1 increased from 0.851 to 0.861. These gains confirm that transformer models generalize better across majority and minority behaviour categories, yielding a more balanced classification under heterogeneous barn environments. Moreover, TimeSformer’s superior performance under nighttime and occluded conditions demonstrates its ability to learn illumination-invariant spatiotemporal representations, a crucial requirement for continuous, unattended surveillance in digital-twin pipelines.

Confusion matrix and error analysis

Whereas overall and per-class metrics quantify accuracy, confusion-matrix analysis exposes systematic misclassification patterns and the behavioural semantics behind them. To examine class-wise prediction behaviour under different training conditions, confusion matrices were generated for both unaugmented and augmented datasets for each architecture. Figure 5(a–d) present a side-by-side comparison: (a) SlowFast - Unaugmented, (b) SlowFast - Augmented, (c) TimeSformer - Unaugmented, and (d) TimeSformer - Augmented. All matrices use identical color scaling for direct comparability.

Fig. 5: Comparison of confusion matrices for behaviour classification using SlowFast and TimeSformer architectures under different training conditions.
Fig. 5: Comparison of confusion matrices for behaviour classification using SlowFast and TimeSformer architectures under different training conditions.
Full size image

A SlowFast model trained on the unaugmented dataset. B SlowFast model trained on the augmented dataset. C TimeSformer model trained on the unaugmented dataset. D TimeSformer model trained on the augmented dataset. Each panel shows the normalized confusion matrix for classification across the seven behavioural categories. Rows represent ground-truth labels and columns represent predicted labels. The matrices illustrate class-wise prediction accuracy and misclassification patterns, enabling comparison of architectural performance and the influence of data augmentation on behavioural separability.

Both matrices exhibit strong diagonal dominance, confirming that most behaviours were correctly identified. However, persistent confusion clusters reveal key perceptual ambiguities inherent in cattle behaviour recognition:

  • Lying vs. Ruminating & Lying (~12–15% cross-confusion): These categories differ mainly through subtle jaw-movement cues. SlowFast’s motion-centric filters sometimes failed to detect small cyclic mandibular motions, labeling ruminating cows as merely lying. TimeSformer’s attention mechanism partially alleviated this overlap by tracking head-region temporal changes, thereby reducing off-diagonal errors.

  • Standing vs. Ruminating & Standing: Both states share identical postural geometry, making visual separation challenging. TimeSformer achieved a higher Standing F1 (0.75 vs. 0.66) due to its capacity to integrate weak but consistent temporal signals, such as rhythmic chewing motions persisting across several seconds.

Overall, the TimeSformer confusion matrix exhibits higher diagonal purity and reduced class-spillover, underscoring its superior ability to encode fine-grained temporal semantics. In practical terms, these improvements translate directly into more precise behavioural state transitions within the digital twin, ensuring that the simulated cow entities maintain realistic time-aligned activity cycles.

A post-hoc qualitative review of misclassified clips was performed to trace error origins and assess their implications for digital-twin fidelity.

Three dominant sources were identified:

Behavioural Ambiguity: Many errors stem from the continuum between similar activities. Static postures punctuated by micro-movements often blur class boundaries. For example, a cow transitioning from Lying to Ruminating & Lying may perform slow, imperceptible head adjustments that span multiple frames-challenging both convolutional and attention-based encoders. These findings emphasize the biological fluidity of behaviour categories, suggesting that discrete classification may need to evolve toward probabilistic or continuous state modeling in future digital-twin iterations.

Environmental and Lighting Variability: The barn environment introduced dynamic illumination, shadowing, and occlusion events. During dusk or under mixed natural-artificial lighting, contrast reduction led to occasional false positives in Drinking when reflective surfaces mimicked mouth-to-trough contact. Although TimeSformer exhibited stronger illumination resilience, both models showed mild accuracy drops in nocturnal segments, implying that domain-adaptive normalization or temporal brightness augmentation could further enhance generalization.

Model Representation Bias: Minority classes (Feeding & Lying, Ruminating & Standing) contained fewer labeled instances, yielding under-optimized feature representations. The transformer’s tokenization may dilute critical motion vectors, while SlowFast’s motion-intensity bias sometimes over-fitted to fast activities.

Mitigation strategies include focal loss weighting, synthetic clip generation, and attention-map regularization to improve the salience of under-represented actions.

Misclassifications manifest as short-term state errors in the virtual twin, occasionally causing temporal discontinuities or overcounting of specific behaviours (e.g., fragmented rumination episodes). However, incorporating temporal smoothing through majority voting over K = 3 overlapping windows per cow, together with minimum-duration filtering, attenuates short-term label flicker and preserves realistic behavioural continuity in the generated timelines. The qualitative review thus reinforces the importance of explainable model diagnostics-through attention-map visualization and per-frame attribution-to support transparent, trustworthy integration of computer-vision outputs into precision-nutrition digital twins.

Architecture-specific strengths and weaknesses

A comparative assessment of both models reveals complementary strengths arising from their distinct spatiotemporal encoding strategies.

The TimeSformer, leveraging transformer-based self-attention, demonstrates superior sensitivity to feeding and drinking behaviours, where subtle head and mouth motions dominate. Its patch-token attention distributes weights dynamically across spatial regions, enabling the model to integrate micro-movements within longer temporal windows. Consequently, TimeSformer effectively identifies sequences where feeding occurs intermittently amid minimal body displacement, an ability crucial for capturing nutritional intake events in the digital twin.

Conversely, the SlowFast architecture excels at posture driven behaviours such as Lying, Standing, and Ruminating & Lying. Its dual-pathway design where the fast branch encodes fine temporal granularity and the slow branch captures broader spatial context allows strong geometric posture recognition even under low motion conditions. The network’s convolutional inductive biases help stabilize predictions in cluttered environments, reducing false positives for large-scale static postures.

Error-pattern inspection further reveals biologically consistent tendencies: misclassifications typically occur between physiologically adjacent states, such as transitions between Ruminating & Standing and Standing, or Lying and Ruminating & Lying. These correspond to authentic behavioural continuums rather than algorithmic artifacts, confirming that both models mirror the natural fluidity of bovine activities rather than imposing arbitrary separations. Hence, the observed confusions hold biological interpretability, underscoring the potential of deep spatiotemporal models as quantitative tools for ethological research.

Spatiotemporal attention visualization

To interpret model decision processes, attention-map visualizations34,35 were generated for representative clips of each behavioural class. These visualizations reveal that learned focus areas align with anatomically relevant regions, reinforcing biological validity.

During Feeding and Drinking, approximately 65% of cumulative attention weight concentrated around the head and muzzle region, particularly near the feed trough and water bucket interfaces. This indicates that the transformer successfully prioritized regions corresponding to ingestion activity, capturing the periodic forward-backward jaw motion characteristic of feeding bouts. For Standing and Lying postures, attention activation shifted toward body and leg contours, representing the model’s recognition of skeletal orientation and support distribution. In Ruminating & Lying states, attention maps alternated between head and thoracic regions, reflecting internalized understanding of chewing cycles within stationary frames.

The corresponding attention visualizations are presented in Fig. 6, which illustrate the spatial concentration of the model’s focus across representative behavioural classes. As shown, the TimeSformer consistently directs attention toward anatomically relevant regions such as the head and muzzle during feeding and drinking or the torso and limbs during postural states demonstrating biologically coherent feature learning. These spatial distributions confirm that model activations correspond to meaningful anatomical cues rather than spurious background patterns. By validating attention heatmaps against known ethological indicators-such as head movement frequency and body-weight distribution-the study establishes a direct link between learned visual salience and biological interpretability.

Fig. 6: Spatiotemporal attention maps highlighting regions contributing to behaviour classification.
Fig. 6: Spatiotemporal attention maps highlighting regions contributing to behaviour classification.
Full size image

Visualization of spatiotemporal attention maps generated by the TimeSformer model for representative behavioural classes. Attention heatmaps indicate the spatial regions and temporal segments that contribute most strongly to classification decisions. Across representative behaviours, attention consistently concentrates on anatomically relevant regions, including the head and muzzle during feeding and drinking and the torso and limbs during postural states.

Robustness analysis

To assess robustness under realistic operating conditions, classification performance was stratified along two key dimensions commonly encountered in barn environments illumination variability and partial visual occlusion. These analyses were conducted as post-hoc evaluations on a trained model, with identical preprocessing and inference settings applied across all subsets.

Classification performance was first analysed under varying lighting conditions using a representative 24-h barn recording. Video clips were grouped into daytime (RGB illumination) and nighttime (infrared illumination) intervals based on recording timestamps. As shown in Table 2, daytime performance remained high (accuracy = 96.1%, macro-F1 = 0.94), while nighttime performance exhibited a consistent reduction (accuracy = 89.8%, macro-F1 = 0.73). Error analysis revealed that most nighttime misclassifications occurred between visually similar posture–oral activity pairs (e.g., ruminating versus lying), reflecting reduced contrast and loss of fine-grained mouth motion cues under infrared illumination. In addition to lighting conditions, robustness to partial occlusion was evaluated by stratifying inference performance based on cow visibility. Clips corresponding to cows with unobstructed camera views were categorized as non-occluded, while clips associated with cows frequently partially obscured by barn structures or other animals were categorized as occluded. Under non-occluded conditions, the model achieved an accuracy of 96.49% with a macro-F1 score of 0.96. Performance under occlusion decreased modestly to 94.74% accuracy and 0.93 macro-F1. Class-wise analysis indicates that behaviours characterized by distinct motion patterns, such as Drinking, remained highly robust under occlusion (F1-score = 0.99). In contrast, fine-grained posture-dependent behaviours exhibited greater sensitivity to partial obstruction. Notably, Feeding & Lying and Ruminating & Standing showed reduced F1-scores under occluded conditions (0.89), compared to ≥ 0.95 under non-occluded conditions.

Table 2 Comparison of classification performance under varying illumination (day/night) and visibility (occluded/non-occluded) conditions using accuracy and macro-F1 score

Overall, these results demonstrate that the proposed TimeSformer-based framework is robust to moderate visual degradation, maintaining high accuracy under both reduced illumination and partial occlusion. While nighttime infrared imagery introduces a larger performance drop than occlusion, both analyses highlight the importance of fine-grained visual cues for distinguishing closely related behavioural states in real-world barn environments.

Notably, classification performance under partial occlusion (96.5% accuracy, macro-F1 = 0.96) was slightly higher than under non-occluded conditions (94.7% accuracy, macro-F1 = 0.93). This counterintuitive result likely reflects selection effects in the stratification process. Occluded clips more frequently corresponded to stable, prolonged behavioural states such as lying or standing, which exhibit limited temporal ambiguity. In contrast, non-occluded clips included a higher proportion of state transitions and boundary frames, which are inherently more difficult to classify.

In addition, partial occlusion in the tie-stall environment is often spatially consistent and caused by fixed barn infrastructure such as feed barriers or stall dividers, which do not obstruct key anatomical regions such as the head or torso. As a result, the model can maintain high recognition accuracy despite reduced visibility. These findings suggest that the proposed framework is tolerant to moderate, structured occlusion, although more severe cow-to-cow overlap, as commonly observed in freestall systems, would be expected to introduce greater performance degradation.

Real-time processing performance

Both architectures were benchmarked on an NVIDIA L4 GPU on google colab using mixed-precision inference. The TimeSformer achieved an average throughput of 22.6 fps, surpassing SlowFast’s 18.4 fps due to its efficient token-wise computation and reduced temporal redundancy. Despite its transformer complexity, TimeSformer maintained lower GPU memory consumption (7.2 GB) compared to SlowFast’s 8.9 GB, attributed to the absence of multi-branch convolutional streams.

From a deployment perspective, both models meet real-time processing thresholds for 25 fps video feeds12. However, the TimeSformer’s higher parallelization efficiency and consistent batch-to-batch latency (~44 ms/frame) make it preferable for continuous inference within digital-twin pipelines where computational scalability and stability are essential. A comparative hardware analysis indicates that real-time deployment is feasible on high-end consumer GPUs or edge-AI modules (e.g., Jetson AGX Orin) with slight down-sampling. Such resource profiling ensures that behaviour recognition remains sustainable for on-farm automation without excessive power draw.

To emulate barn-scale conditions, the end-to-end system was evaluated on simultaneous video streams tracking 16 individual cows using YOLOv11 + ByteTrack for detection and identity assignment. The combined detection-tracking-classification pipeline achieved an average latency of 180 ms per frame, corresponding to real-time throughput at 5.5 fps per cow. Batch-processing optimization and asynchronous GPU queues reduced total inference time by 27%, confirming the framework’s scalability to group monitoring without significant degradation in accuracy. These benchmarks demonstrate the system’s capacity for large-herd digital-twin synchronization, where multiple physical cows can be updated concurrently in the virtual environment. Future work will explore multi-GPU distribution and temporal batching to achieve near-real-time herd-level behaviour analytics.

Temporal sampling ablation and design trade-offs

To further examine the effect of temporal sampling density on recognition performance, we conducted a controlled ablation study in which the number of frames sampled per 10-s clip was varied while all other training settings were held constant. Sampling at 12 frames per clip (1.2 fps) achieved a validation accuracy of 0.75, establishing a strong baseline under sparse temporal sampling. Increasing the sampling density to 16 frames per clip (1.6 fps) improved validation performance, yielding the highest accuracy 0.79 indicating that moderate increases in temporal resolution can enhance recognition of behaviour cues.

However, further increasing temporal resolution to 32 frames per clip (3.2 fps) led to a decline in validation performance (accuracy 0.70) while substantially increasing computational cost and training time. Taken together, these results demonstrate that higher temporal density does not monotonically improve generalization under fixed training budgets. Considering these findings, the primary experiments retain the 12-frame configuration as a conservative and computationally efficient setting, consistent with the goal of continuous 24/7 deployment, while the ablation results highlight the potential benefits and limitations of denser temporal sampling.

To ensure tractable experimentation, the ablation analysis was conducted on a representative subset of N = 1500 clips (balanced across all seven behavioral classes) while preserving identical training protocols, augmentation strategies, and optimization settings across sampling configurations. This approach isolates the effect of temporal sampling density without introducing variations in data composition or optimization settings.

Digital twin case study: 24 h behavioural to intake inference

To demonstrate the practical utility of the proposed Digital Twin framework beyond isolated computer vision performance metrics, we present a 24-h case study for a representative dairy cow. Cow 363 in the farm database corresponding to cow_id = 2 in the vision-based activity tracking system was selected for analysis on March 15, 2025.

Figure 7 illustrates the complete 24-h behavioural timeline derived from continuous video-based perception. Raw activity detections were aggregated into a structured temporal data stream, capturing transitions between feeding, standing, lying, ruminating, and drinking states throughout the day. Each segment in the timeline represents an automatically detected behavioural event with an associated duration.

Fig. 7: Digital Twin case study illustrating a 24-h behavioural timeline derived from continuous video perception and its use in behaviour-informed intake estimation for an individual dairy cow.
Fig. 7: Digital Twin case study illustrating a 24-h behavioural timeline derived from continuous video perception and its use in behaviour-informed intake estimation for an individual dairy cow.
Full size image

A full-day behavioural timeline derived from continuous video-based perception is shown for an individual dairy cow. Automatically detected behavioural events are aggregated into structured activity segments with associated durations, which are subsequently used to compute a behaviour-informed dry matter intake estimate within the digital-twin framework, linking low-level perception to higher-level nutritional inference.

From this structured representation, daily feeding duration was computed by aggregating all feeding-related behavioural states. Across the 24-h period, the cow exhibited a total feeding duration of 253.1 min (4.22 h). This behavioural summary constitutes a core Digital Twin state variable, bridging low-level perception with higher-level physiological inference.

Using the aggregated feeding duration in conjunction with known physiological parameters for Cow 363 (body weight = 942 kg, days in milk = 294, parity = 4), a behaviour-informed dry matter intake (DMI) estimate was computed using a saturating intake model20,24,36. The resulting Digital Twin estimated a daily dry matter intake of 22.22 kg/day, demonstrating how video-derived behavioural signals can directly parameterize downstream nutritional inference.

Unlike traditional pipelines that treat perception and decision-making as disjoint processes, this case study highlights how continuous sensing, temporal aggregation, and domain-specific modelling can be integrated within a unified Digital Twin architecture. The resulting behavioural and nutritional state estimates provide a foundation for individualized monitoring, intake assessment, and future closed-loop nutritional decision support.

For a Digital Twin system, continuity of identity across time is critical to ensure that behavioural timelines remain coherent and do not introduce distortions into downstream nutritional inference. To assess tracking reliability, we evaluated the continuity properties of the generated behavioural timelines over a full 24-h period. For the case-study cow, the constructed timeline achieved 77.8% temporal coverage across the day, with a median uninterrupted segment duration of approximately 60 min and a fragmentation rate of 0.29 interruptions per hour. Notably, 68% of interruptions were resolved within 30 s, indicating rapid identity re-association following brief tracking losses.

These continuity characteristics indicate that identity persistence was sufficient to support stable behavioural aggregation and downstream intake estimation, and that the resulting Digital Twin timelines were not dominated by frequent or prolonged identity fragmentation.

Discussion

In precision livestock farming, the term “digital twin” is frequently interpreted as a fully bidirectional cyber-physical system that integrates sensing, prediction, and automated control. In practice, such systems are modular, with perception, state estimation, physiological modeling and decision layers developed and validated independently. This study addresses a critical but often under-specified component: the behavioural perception and state layer. By transforming continuous barn video streams into identity-preserving, temporally structured behavioural events, the proposed pipeline provides the state variables required by higher-level nutritional, health, and visualization modules.

The structured behavioural outputs produced in this work are designed to be consumed by external digital-twin components rather than to constitute a complete twin on their own. While a Unity-based visualization environment is used to display and synchronize these outputs with virtual cow representations, the development of full physiological models, optimization routines, and automated decision or actuation mechanisms lies beyond the scope of the present study. The results therefore establish a validated perception foundation upon which more comprehensive digital-twin implementations can be constructed.

The TimeSformer architecture demonstrated distinct advantages for livestock video analytics owing to its global spatiotemporal attention mechanism, which captures both spatial structures and long-term temporal evolution. By dynamically allocating attention to relevant regions such as the head, muzzle, or feed trough, the model maintains interpretability while adapting to motion sparsity typical of commercial barns. This allows robust recognition of prolonged behaviours like ruminating and feeding & standing, whose visual cues evolve slowly over time. In our 24/7 barn recordings, TimeSformer achieved 85.0% overall accuracy (macro-F1 = 0.84) and processed 22.6 fps on an NVIDIA L4, confirming real-time feasibility. The SlowFast architecture, while slightly lower in accuracy (82.3%), remained competitive for posture-oriented states such as Standing or Lying. Its dual-pathway design (slow spatial, fast motion) efficiently captures short-duration transitions like drinking or standing changes, with predictable latency and low memory overhead, making it attractive for cost-constrained edge deployments. Compared with recent benchmarks CBVD-537 (687 segments, 107 Holsteins; ~ 78.7% accuracy) and the beef-cattle dataset of Cao et al.11. (4974 clips; ~90% mAP₅₀) our dataset of 4964 annotated clips (expanded to 9600 via augmentation) encompasses seven behaviour classes under natural lighting and occlusion, achieving parity with state-of-the-art accuracy despite far noisier conditions. The scale, behavioural granularity, and continuous capture across diurnal cycles make it one of the largest open dairy-behaviour video resources, bridging the gap between controlled research corpora and operational barns.

Beyond quantitative accuracy, attention-based interpretability plays an important role in fostering trust and adoption of automated monitoring systems in livestock production. By visualizing the spatial regions and temporal segments that drive classification decisions, farm operators and veterinarians can corroborate algorithmic outputs with on-ground observations, improving transparency and confidence in system outputs. In the context of a digital twin, such interpretability supports the translation of model activations into meaningful behavioural biomarkers, enabling downstream nutritional adjustments, welfare diagnostics, and anomaly detection.

The structured behavioural outputs cow ID, start/end times, durations, and confidence scores form the behavioural layer of the dairy digital twin. These CSV streams can directly feed NRC feed-intake models to infer dry-matter intake and dynamically adjust rations. Integration within Unity 3D enables real-time visualization of individual cows’ states synchronized with nutritional and environmental modules, establishing an operational link between perception and physiology. Within the twin dashboard, deviations from baseline feeding or rumination cycles act as early-warning indicators of metabolic or welfare issues, supporting proactive interventions. The architecture will be interoperable with commercial herd-management systems such as DairyComp 305 and Lely T4C and will be designed for modular deployment using networked edge devices. Both TimeSformer and SlowFast achieve real-time throughput on mid-range GPUs, ensuring cost-effective scalability. Economic analyses should consider reductions in manual observation labor, improved feed efficiency, and health-related cost savings. Reliability under barn humidity and dust will be maintained through ruggedized enclosures and modular camera design.

The results presented in this study were obtained from a tie-stall housing system, which provides controlled viewpoints, limited animal displacement, and reduced inter-animal occlusion. Such environments are well suited for developing and validating behavioural perception pipelines, as they allow stable identity tracking and clear association between individual cows and observed behaviours. Tie-stall deployment therefore represents a controlled initial tier for assessing the reliability, interpretability and real-time performance of video-based behaviour recognition systems.

In contrast, free stall barns more prevalent in many leading dairy regions can introduce additional challenges, including frequent occlusion, unrestricted animal movement and transitions across multiple camera views. While these factors increase system complexity, they do not fundamentally alter the role of the proposed framework as a behavioural perception and state-estimation layer. Extension to free stall environments would primarily require architectural augmentations such as multi-view camera fusion, appearance-based re-identification for cross-camera identity persistence, and more robust occlusion handling. In this context, the current tie-stall implementation serves as a foundational deployment stage, with free stall barns representing a subsequent scalability tier. Addressing these challenges constitutes a natural next step toward broader commercial applicability rather than a limitation of the proposed approach.

Looking ahead, future work will explore extensions of the proposed framework through multimodal sensing, model optimization, and scalable deployment strategies. Prior studies have demonstrated that combining visual data with acoustic or thermal sensing can improve robustness of rumination and feeding detection under occlusion and low-light conditions10,38,39. In addition, recent work on lightweight transformer architectures, pruning, and quantization has shown that inference cost can be substantially reduced while maintaining competitive accuracy, supporting deployment on resource-constrained edge devices40,41,42. Federated learning approaches have also been investigated in agricultural and video-analytics settings to enable cross-farm model adaptation without sharing raw video data, addressing privacy and data-governance concerns43,44. These directions represent promising avenues for future research and will be investigated to further enhance robustness, scalability, and practical adoption of digital twin–ready behavioural perception systems.

Despite the strong performance observed across spatiotemporal model architectures and temporal sampling configurations, several challenges and methodological limitations define the scope of the present study. These limitations reflect practical constraints associated with continuous video-based livestock monitoring in operational barn environments, as well as deliberate design trade-offs made to support stable evaluation, real-time feasibility, and downstream Digital Twin integration.

Despite strong baseline accuracy, environmental variability remained a significant factor influencing detection reliability. Quantitative analysis revealed an 8–12% reduction in recognition accuracy under night-time conditions, attributable to reduced contrast and increased sensor noise in infrared video10. These effects, explicitly quantified in the day–night performance analysis, represent fundamental sensing constraints rather than model-specific deficiencies. Occlusions caused by barn infrastructure, feeding equipment, and inter-animal overlap further contributed to intermittent trajectory loss and occasional truncation of behavioural sequences. Although the tracking system successfully reidentified approximately 92% of interrupted tracks, brief identity switches introduced minor temporal labelling noise3. Detection consistency was also reduced near the periphery of the camera field of view, where cows partially exited the frame, limiting spatial continuity for both TimeSformer and SlowFast architectures. These observations highlight the importance of optimized camera placement, adaptive illumination normalization, and multi-view camera fusion for robust long-term deployment.

Behaviour-specific recognition limitations were most pronounced in the persistent confusion between Lying and Ruminating & Lying states. These behaviours exhibit near-identical postural geometry, with differentiation relying primarily on subtle mandibular motion25. Such motion cues are occasionally obscured by occlusion or temporally aliased at lower frame rates, constraining discriminative capacity even for advanced spatiotemporal architectures. Improving temporal resolution or incorporating motion-sensitive descriptors, such as optical flow, may enhance distinction in future iterations. Drinking behaviour detection posed additional challenges, false positives were observed when cows lowered their heads near troughs without active ingestion, underscoring the limitations of vision-only sensing for intake-related behaviours and motivating future integration of complementary sensing modalities, such as acoustic monitoring or RFID-triggered water intake measurements10.

Limitations were also observed in temporal-sequence modelling, particularly during rapid transitions between behavioural states (e.g., standing to lying). Both deep-video architectures occasionally produced fragmented predictions around state boundaries, reflecting the absence of explicit temporal continuity constraints. While current models demonstrate strong performance for isolated state recognition, achieving biologically faithful, continuous behavioural tracking for Digital Twin applications may require hybrid temporal approaches that integrate deep attention mechanisms with recurrent or probabilistic state-transition modelling. Tracking reliability similarly warrants further quantification. While the YOLOv11–ByteTrack pipeline preserves per-cow identities across continuous recordings, standard multi-object tracking metrics have not yet been evaluated on an annotated subset. The reported continuity analyses therefore reflect identity persistence under operational conditions rather than benchmarked tracking performance.

From a dataset design perspective, the present study employed a random clip-level train–validation–test split, a strategy commonly adopted in video-based animal behaviour recognition to enable stable and reproducible comparisons across architectures and temporal sampling strategies. Under this setting, video clips originating from the same cow may appear across different splits, potentially allowing models to exploit consistent visual characteristics (e.g., body size, coat texture, markings) alongside behaviour-related motion cues. Consequently, reported performance metrics should be interpreted as reflecting behaviour recognition capability under partially subject-overlapping conditions. Future studies will strengthen generalization assessment by adopting subject-independent evaluation protocols, such as leave-one-cow-out or temporally disjoint splits, to explicitly quantify cross-animal transferability.

Finally, the behavioural taxonomy adopted in this study reflects a pragmatic, monitoring-oriented formulation that combines posture and oral activity into composite labels (e.g., “standing & ruminating”). While this representation supports real-time visualization, behavioural summarization, and downstream Digital Twin integration, it necessarily abstracts certain fine-grained biological distinctions. Future investigations may explore more granular or multi-label behavioural representations to improve interpretability and biological resolution, particularly for nutrition- and health-oriented Digital Twin applications.

Reliable behavioural perception presents a fundamental prerequisite for meaningful Digital Twin applications in dairy production. The framework presented here establishes a video-based behavioural sensing and state-estimation layer that transforms continuous barn video into structured, identity-preserving behavioural streams suitable for downstream modelling and decision support. By integrating spatiotemporal deep learning with temporal aggregation and identity continuity, the approach advances beyond isolated behaviour classification toward coherent, per-animal state reconstruction under commercial barn conditions.

Comprehensive evaluation across model architectures, temporal sampling strategies, and environmental conditions demonstrates robust recognition of core dairy behaviours in the presence of lighting variability and partial occlusion. A 24-h Digital Twin case study further illustrates how continuous video perception can be translated into a structured behavioural timeline and subsequently used to derive behaviour-informed intake estimates, thereby closing the perception-to-decision loop that underpins operational Digital Twin systems.

Although the scope of the work is intentionally focused on behavioural sensing and state reconstruction, it provides a validated foundation for integrating nutritional, health, and management models in future Digital Twin deployments. Ongoing and future efforts will extend this framework through subject-independent evaluation, multimodal sensing, and tighter coupling with mechanistic and data-driven nutritional models. Together, these advances will enable more comprehensive, scalable, and biologically grounded Digital Twins, supporting precision livestock farming that is both operationally practical and scientifically robust.