Abstract
Advances in computer vision and increasingly widespread video-based behavioral monitoring are currently transforming how we study animal behavior. However, there is still a gap between the prospects and practical application, especially in videos from the wild. In this Perspective, we aim to present the capabilities of current methods for behavioral analysis, while at the same time highlighting unsolved computer vision problems that are relevant to the study of animal behavior. We survey state-of-the-art methods for computer vision problems relevant to the video-based study of individualized animal behavior, including object detection, multi-animal tracking, individual identification and (inter)action understanding. We then review methods for effort-efficient learning, one of the challenges from a practical perspective. In our outlook on the emerging field of computer vision for animal behavior, we argue that the field should develop approaches to unify detection, tracking, identification and (inter)action understanding in a single, video-based framework.
This is a preview of subscription content, access via your institution
Access options
Access Nature and 54 other Nature Portfolio journals
Get Nature+, our best-value online-access subscription
$32.99 / 30 days
cancel any time
Subscribe to this journal
Receive 12 print issues and online access
$259.00 per year
only $21.58 per issue
Buy this article
- Purchase on SpringerLink
- Instant access to full article PDF
Prices may be subject to local taxes which are calculated during checkout





Similar content being viewed by others
References
Couzin, I. D. & Heins, C. Emerging technologies for behavioral research in changing environments. Trends Ecol. Evol. 38, 346–354 (2023).
Bateson, M. & Martin, P. Measuring Behaviour: an Introductory Guide (Cambridge Univ. Press, 2021).
Papadakis, V. M., Papadakis, I. E., Lamprianidou, F., Glaropoulos, A. & Kentouri, M. A computer-vision system and methodology for the analysis of fish behavior. Aquacult. Eng. 46, 53–59 (2012).
Segalin, C. et al. The mouse action recognition system (MARS) software pipeline for automated analysis of social behaviors in mice. eLife 10, e63720 (2021).
Luxem, K. et al. Open-source tools for behavioral video analysis: setup, methods, and best practices. eLife 12, e79305 (2023).
Mathis, A. et al. DeepLabCut: markerless pose estimation of user-defined body parts with deep learning. Nat. Neurosci. 21, 1281–1289 (2018).
Lauer, J. et al. Multi-animal pose estimation, identification and tracking with DeepLabCut. Nat. Methods 19, 496–504 (2022).
Pereira, T. D. et al. SLEAP: a deep learning system for multi-animal pose tracking. Nat. Methods 19, 486–495 (2022).
Walter, T. & Couzin, I. D. TRex, a fast multi-animal tracking system with markerless identification, and 2D estimation of posture and visual fields. eLife 10, e64000 (2021).
Graving, J. M. et al. DeepPoseKit, a software toolkit for fast and robust animal pose estimation using deep learning. eLife 8, e47994 (2019).
Jordan, G. A. et al. Automated system for training and assessing reaching and grasping behaviors in rodents. J. Neurosci. Methods 401, 109990 (2024).
Wiltshire, C. et al. DeepWild: application of the pose estimation tool DeepLabCut for behaviour tracking in wild chimpanzees and bonobos. J. Anim. Ecol. 92, 1560–1574 (2023).
Dendorfer, P. et al. MOT20: a benchmark for multi object tracking in crowded scenes. Preprint at https://arxiv.org/abs/2003.09003 (2020).
Lin, T. -Y. et al. Microsoft COCO: common objects in context. In European Conference on Computer Vision, 740–755 (2014).
Ji, J., Krishna, R., Fei-Fei, L. & Niebles, J. C. Action genome: actions as compositions of spatio-temporal scene graphs. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 10236–10247 (2020).
Li, A. et al. The AVA-Kinetics localized human actions video dataset. Preprint at https://arxiv.org/abs/2005.00214 (2020).
Mathis, M. W. & Mathis, A. Deep learning tools for the measurement of animal behavior in neuroscience. Curr. Opin. Neurobiol. 60, 1–11 (2020).
Pereira, T. D., Shaevitz, J. W. & Murthy, M. Quantifying behavior to understand the brain. Nat. Neurosci. 23, 1537–1549 (2020).
Perez, M. & Toler-Franklin, C. CNN-based action recognition and pose estimation for classifying animal behavior from videos: a survey. Preprint at https://arxiv.org/abs/2301.06187 (2023).
Kirillov, A. et al. Segment anything. In Proc. IEEE/CVF International Conf. on Computer Vision, 4015–4026 (2023).
Güler, R. A., Neverova, N. & Kokkinos, I. DensePose: dense human pose estimation in the wild. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, 7297–7306 (2018).
Girshick, R. B., Donahue, J., Darrell, T. & Malik, J. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 580–5877 (2013).
Ren, S., He, K., Girshick, R. B. & Sun, J. Faster R-CNN: towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 39, 1137–1149 (2015).
He, K., Gkioxari, G., Dollár, P. & Girshick, R. B. Mask R-CNN. IEEE Trans. Pattern Anal. Mach. Intell. 42, 386–397 (2017).
Liu, W. et al. SSD: single shot multibox detector. In Computer Vision – ECCV 2016: 14th European Conf,, Amsterdam, The Netherlands, October 11–14, 2016, Proc., Part I, 21–37 (Springer, 2016).
Redmon, J. & Farhadi, A. YOLO9000: better, faster, stronger. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 6517–6525 (2016).
Redmon, J. & Farhadi, A. YOLOv3: an incremental improvement. Preprint at https://arxiv.org/abs/1804.02767, (2018).
Ge, Z., Liu, S., Wang, F., Li, Z. & Sun, J. YOLOX: exceeding YOLO series in 2021. Preprint at https://arxiv.org/abs/2107.08430 (2021).
Zhou, X., Wang, D. & Krähenbühl, P. Objects as points. Preprint at https://arxiv.org/abs/1904.07850, (2019).
Tian, Z., Shen, C., Chen, H. & He, T. Fcos: Fully convolutional one-stage object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 9627–9636 (2019).
Vaswani, A. et al. Attention is all you need. In Neural Information Processing Systems (NeurIPS), 5998–6008 (2017).
Carion, N. et al. End-to-end object detection with transformers. In European Conf. on Computer Vision, 213–229 (2020).
Zhu, X. et al. Deformable DETR: deformable transformers for end-to-end object detection. In International Conf. on Learning Representations (2021).
Zhang, H. et al. DINO: DETR with improved denoising anchor boxes for end-to-end object detection. In International Conf. on Learning Representations (2022).
Schneider, S., Taylor, G. W. & Kremer, S. Deep learning object detection methods for ecological camera trap data. In The 15th Conference on Computer and Robot Vision (CRV), 321–328 (IEEE, 2018).
Guo, S. et al. Automatic identification of individual primates with deep learning techniques. iScience 23, 101412 (2020).
Yang, X., Mirmehdi, M. & Burghardt, T. Great ape detection in challenging jungle camera trap footage via attention-based spatial and temporal feature blending. In Proc. of the IEEE/CVF International Conf. on Computer Vision Workshops, 0–0 (2019).
Roy, A. M., Bhaduri, J., Kumar, T. & Raj, K. WilDect-YOLO: an efficient and robust computer vision-based accurate object localization model for automated endangered wildlife detection. Ecol. Inform. 75, 101919 (2023).
Yang, X., Burghardt, T. & Mirmehdi, M. Dynamic curriculum learning for great ape detection in the wild. Int. J. Comput. Vis. 131, 1163–1181 (2023).
Duporge, I. et al. BaboonLand dataset: tracking primates in the wild and automating behaviour recognition from drone videos. Preprint at https://arxiv.org/abs/2405.17698 (2024).
Bain, M., Nagrani, A., Schofield, D. & Zisserman, A. Count, crop and recognise: fine-grained recognition in the wild. In 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW) 236–246 (2019).
Brookes, O. et al. PanAf20K: a large video dataset for wild ape detection and behaviour recognition. Int. J. Comput. Vis. 1–17 (2024).
Ma, X. et al. Chimpact: a longitudinal dataset for understanding chimpanzee behaviors. Adv. Neural Inf. Process. Syst. 36, 27501–27531 (2023).
Ye, S. et al. SuperAnimal pretrained pose estimation models for behavioral analysis. Nat. Commun. 15, 5165 (2024).
Yao, Y. et al. OpenMonkeyChallenge: dataset and benchmark challenges for pose estimation of non-human primates. Int. J. Comput. Vis. 131, 243–258 (2023).
Desai, N. P. et al. OpenApePose: a database of annotated ape photographs for pose estimation. eLife 12, RP86873 (2023).
Labuguen, R. T. et al. MacaquePose: a novel ‘in the wild’ macaque monkey pose dataset for markerless motion capture. Front. Behav. Neurosci. 14, 581154 (2020).
Bewley, A., Ge, Z., Ott, L., Ramos, F. & Upcroft, B. Simple online and realtime tracking. In 2016 IEEE International Conference on Image Processing (ICIP) 3464–3468 (2016).
Zhang, Y. et al. ByteTrack: multi-object tracking by associating every detection box. In European Conf. on Computer Vision, 1–21 (2022).
Maggiolino, G., Ahmad, A., Cao, J. & Kitani, K. Deep OC-SORT: multi-pedestrian tracking by adaptive re-identification. In 2023 IEEE International Conf. on Image Processing (ICIP), 3025–3029 (2023).
Aharon, N., Orfaig, R. & Bobrovsky, B. -Z. BoT-SORT: robust associations multi-pedestrian tracking. Preprint at https://arxiv.org/abs/2206.14651 (2022).
Zeng, F. et al. MOTR: end-to-end multiple-object tracking with transformer. In European Conference on Computer Vision, 659–675 (2022).
Zhang, Y., Wang, T. & Zhang, X. MOTRv2: bootstrapping end-to-end multi-object tracking by pretrained object detectors. In Proc. of the IEEE/CVF Conf. on Computer Vision and Pattern Recognition, 22056–22065 (2023).
Yan, F., Luo, W., Zhong, Y., Gan, Y. & Ma, L. Bridging the gap between end-to-end and non-end-to-end multi-object tracking. Preprint at https://arxiv.org/abs/2305.12724 (2023).
Zhang, L., Gao, J., Xiao, Z. & Fan, H. AnimalTrack: a benchmark for multi-animal tracking in the wild. Int. J. Comput. Vis. 131, 496–513 (2023).
Pineda, R. R., Kubo, T., Shimada, M. & Ikeda, K. Deep MAnTra: deep learning-based multi-animal tracking for japanese macaques. Artif. Life Robot. 28, 127–138 (2023).
Vogg, R. et al. PriMAT: A robust multi-animal tracking model for primates in the wild. Preprint at bioRxiv https://doi.org/10.1101/2024.08.21.607881 (2024).
Frey, S., Fisher, J. T., Burton, A. C. & Volpe, J. P. Investigating animal activity patterns and temporal niche partitioning using camera-trap data: challenges and opportunities. Remote Sens. Ecol. Conserv. 3, 123–132.
Schofield, D. P. et al. Automated face recognition using deep neural networks produces robust primate social networks and sociality measures. Methods Ecol. Evol. 14, 1937–1951 (2023).
Royle, J. A. in Camera Traps in Animal Ecology: Methods and Analyses (eds. O’Connell, A. F. et al.) 163–190 (Springer, 2011).
Vidal, M., Wolf, N., Rosenberg, B., Harris, B. P. & Mathis, A. Perspectives on individual animal identification from biology and computer vision. Integr. Comp. Biol. 61, 900–916 (2021).
Marks, M. et al. Deep-learning-based identification, tracking, pose estimation and behaviour classification of interacting primates and mice in complex environments. Nat. Mach. Intell. 4, 331–340 (2022).
Freytag, A. et al. Chimpanzee faces in the wild: log-euclidean CNNs for predicting identities and attributes of primates. In Pattern Recognition: 38th German Conference, GCPR (eds. Rosenhahn, B. et al.) 9796 (Springer, 2016).
Brookes, O. & Burghardt, T. A dataset and application for facial recognition of individual gorillas in zoo environments. Preprint at https://arxiv.org/abs/2012.04689 (2021).
Tieo, S. et al. The mandrillus face database: a portrait image database for individual and sex recognition, and age prediction in a non-human primate. Data Brief. 47, 108939 (2023).
Parkhi, O., Vedaldi, A. & Zisserman, A. Deep face recognition. In BMVC Proceedings Of The British Machine Vision Conference 2015 (British Machine Vision Association, 2015).
Schroff, F., Kalenichenko, D. & Philbin, J. Facenet: a unified embedding for face recognition and clustering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 815–823 (2015).
Deng, J., Guo, J., Xue, N. & Zafeiriou, S. Arcface: additive angular margin loss for deep face recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 4690–4699 (2019).
Deb, D. et al. Face recognition: primates in the wild. In 2018 IEEE 9th International Conference on Biometrics Theory, Applications and Systems (BTAS), 1–10 (IEEE, 2018).
Charpentier, M. J. et al. Same father, same face: deep learning reveals selection for signaling kinship in a wild primate. Sci. Adv. 6, eaba3274 (2020).
Schofield, D. et al. Chimpanzee face recognition from videos in the wild using deep learning. Sci. Adv. 5, eaaw0736 (2019).
Brookes, O. et al. Evaluating cognitive enrichment for zoo-housed gorillas using facial recognition. Front. Vet. Sci. 9, 886720 (2022).
Paulet, J. et al. Deep learning for automatic facial detection and recognition in japanese macaques: illuminating social networks. Primates 65, 265–279 (2024).
Mathis, A., Schneider, S., Lauer, J. & Mathis, M. W. A primer on motion capture with deep learning: principles, pitfalls, and perspectives. Neuron 108, 44–65 (2020).
Qiu, Z., Yao, T. & Mei, T. Learning spatio-temporal representation with pseudo-3D residual networks. In Proceedings of the IEEE International Conference on Computer Vision, 5533–5541 (2017).
Lee, Y., Kim, H. -I., Yun, K. & Moon, J. Diverse temporal aggregation and depthwise spatiotemporal factorization for efficient video classification. IEEE Access 9, 163054–163064 (2021).
Feichtenhofer, C., Fan, H., Malik, J. & He, K. Slowfast networks for video recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 6202–6211 (2019).
Arnab, A. et al. Vivit: a video vision transformer. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 6836–6846 (2021).
Bertasius, G., Wang, H. & Torresani, L. Is space-time attention all you need for video understanding? ICML 2, 4 (2021).
Ryali, C. et al. Hiera: a hierarchical vision transformer without the bells-and-whistles. In International Conf. on Machine Learning, 29441–29454 (2023).
Wu, C. -Y. et al. Memvit: memory-augmented multiscale vision transformer for efficient long-term video recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 13587–13597 (2022).
Li, K. et al. UniFormer: unified transformer for efficient spatial-temporal representation learning. In International Conf. on Learning Representations (2022).
Patrick, M. et al. Keeping your eye on the ball: trajectory attention in video transformers. Adv. Neural Inf. Process. Syst. 34, 12493–12506 (2021).
Long, F. et al. Stand-alone inter-frame attention in video models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 3192–3201 (2022).
Materzynska, J. et al. Something-else: compositional action recognition with spatial-temporal interaction networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 1049–1059 (2020).
Tian, Y., Yan, Y., Zhai, G., Guo, G. & Gao, Z. EAN: event adaptive network for enhanced action recognition. Int. J. Comput. Vis. 130, 2453–2471 (2022).
Faure, G. J., Chen, M. -H. & Lai, S. -H. Holistic interaction transformer network for action detection. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 3340–3350 (2023).
Cheng, F. & Bertasius, G. TallFormer: temporal action localization with a long-memory transformer. In European Conf. on Computer Vision, 503–521 (2022).
Zhang, C. L., Wu, J. & Li, Y. ActionFormer: localizing moments of actions with transformers. In European Conf. on Computer Vision, 492–510 (2022).
Tang, T. N., Kim, K. & Sohn, K. TemporalMaxer: maximize temporal context with only max pooling for temporal action localization. Preprint at https://arxiv.org/abs/2303.09055 (2023).
Shi, D. et al. TriDet: temporal action detection with relative boundary modeling. In Proc. of the IEEE/CVF Conf. on Computer Vision and Pattern Recognition, 18857–18866 (2023).
Wang, J., Huang, J., Zhang, C. & Deng, Z. Cross-modality time-variant relation learning for generating dynamic scene graphs. In 2023 IEEE International Conf. on Robotics and Automation (ICRA), 8231–8238 (2023).
Feng, S., Mostafa, H., Nassar, M., Majumdar, S. & Tripathi, S. Exploiting long-term dependencies for generating dynamic scene graphs. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 5130–5139 (2023).
Gao, K., Chen, L., Niu, Y., Shao, J. & Xiao, J. Classification-then-grounding: reformulating video scene graphs as temporal bipartite graphs. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 19497–19506 (2022).
Wei, M., Chen, L., Ji, W., Yue, X. & Zimmermann, R. In defense of clip-based video relation detection. IEEE Trans. Image Process. 33 (2024).
Sakib, F. & Burghardt, T. Visual recognition of great ape behaviours in the wild. Preprint at https://arxiv.org/abs/2011.10759 (2020).
Lei, Y. et al. Postural behavior recognition of captive nocturnal animals based on deep learning: a case study of Bengal slow loris. Sci. Rep. 12, 7738 (2022).
Fuchs, M., Genty, E., Bangerter, A., Zuberbühler, K. & Cotofrei, P. From forest to zoo: great ape behavior recognition with ChimpBehave. Preprint at https://arxiv.org/abs/2405.20025 (2024).
Chen, J. et al. Mammalnet: a large-scale video benchmark for mammal recognition and behavior understanding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 13052–13061 (2023).
Bain, M. et al. Automated audiovisual behavior recognition in wild primates. Sci. Adv. 7, eabi4883 (2021).
Brookes, O., Mirmehdi, M., Kühl, H. & Burghardt, T. Triple-stream deep metric learning of great ape behavioural actions. Preprint at https://arxiv.org/abs/2301.02642 (2023).
Brookes, O., Mirmehdi, M., Kuhl, H. & Burghardt, T. ChimpVLM: ethogram-enhanced chimpanzee behaviour recognition. Preprint at https://arxiv.org/abs/2404.08937 (2024).
Zhao, L. et al. Visual recognition of great ape behaviours in the wild. In Proceedings of the 41st International Conf. on Machine Learning, 60785–60811 (PMLR, 2024).
Zamir, A. R. et al. Taskonomy: disentangling task transfer learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 3712–3722 (2018).
Mensink, T., Uijlings, J., Kuznetsova, A., Gygli, M. & Ferrari, V. Factors of influence for transfer learning across diverse appearance domains and task types. IEEE Trans. Pattern Anal. Mach. Intell. 44, 9298–9314 (2021).
Nayman, N., Golbert, A., Noy, A. & Zelnik-Manor, L. Diverse imagenet models transfer better. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 1914–1925 (2024).
Salman, H., Ilyas, A., Engstrom, L., Kapoor, A. & Madry, A. Do adversarially robust imagenet models transfer better? Adv. Neural Inf. Process. Syst. 33, 3533–3545 (2020).
Russakovsky, O. et al. ImageNet large scale visual recognition challenge. Int. J. Comput. Vis. 115, 211–252 (2014).
He, K. et al. Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 16000–16009 (2022).
Chen, T., Kornblith, S., Norouzi, M. & Hinton, G. A simple framework for contrastive learning of visual representations. In International Conference on Machine Learning, 1597–1607 (PMLR, 2020).
Chen, X. & He, K. Exploring simple siamese representation learning. in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 15750–15758 (2021).
Radford, A. et al. Learning transferable visual models from natural language supervision. In International Conf. on Machine Learning, 8748–8763 (PMLR, 2021).
Ju, C., Han, T., Zheng, K., Zhang, Y. & Xie, W. Prompting visual-language models for efficient video understanding. In European Conference on Computer Vision, 105–124 (Springer, 2022).
Lin, Z. et al. Frozen clip models are efficient video learners. In European Conference On Computer Vision, 388–404 (Springer, 2022).
Jing, Y., Wang, C., Zhang, R., Liang, K. & Ma, Z. Category-specific prompts for animal action recognition with pretrained vision-language models. In Proceedings of the 31st ACM International Conference on Multimedia, 5716–5724 (2023).
Yang, J., Zeng, A., Zhang, R. & Zhang, L. UniPose: detecting any keypoints. Preprint at https://arxiv.org/abs/2310.08530 (2023).
Lüddecke, T. & Ecker, A. Image segmentation using text and image prompts. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 7086–7096 (2022).
Zhang, R. et al. Pointclip: Point cloud understanding by clip. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 8552–8562 (2022).
Gabeff, V., Rußwurm, M., Tuia, D. & Mathis, A. WildCLIP: scene and animal attribute retrieval from camera trap data with domain-adapted vision-language models. Int. J. Comput. Vis. 1–17 (2024).
Berthelot, D. et al. Mixmatch: a holistic approach to semi-supervised learning. Adv. Neural Inf. Process. Syst. 32, (2019).
Sohn, K. et al. Fixmatch: simplifying semi-supervised learning with consistency and confidence. Adv. Neural Inf. Process. Syst. 33, 596–608 (2020).
Oliver, A., Odena, A., Raffel, C. A., Cubuk, E. D. & Goodfellow, I. Realistic evaluation of deep semi-supervised learning algorithms. Adv. Neural Inf. Process. Syst. 31, (2018).
Bearman, A., Russakovsky, O., Ferrari, V. & Fei-Fei, L. What’s the point: semantic segmentation with point supervision. In European Conference on Computer Vision, 549–565 (Springer, 2016).
Greff, K. et al. Kubric: a scalable dataset generator. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 3749–3761 (2022).
Dosovitskiy, A. et al. FlowNet: learning optical flow with convolutional networks. In 2015 IEEE International Conference on Computer Vision (ICCV), 2758–2766 (2015).
Karaev, N. et al. CoTracker: it is better to track together. In European Conf. on Computer Vision, 18–35 (2024).
Doersch, C. et al. TAPIR: tracking any point with per-frame initialization and temporal refinement. In Proc. of the IEEE/CVF International Conf. on Computer Vision, 10061–10072 (2023).
Plum, F., Bulla, R., Beck, H. K., Imirzian, N. & Labonte, D. replicAnt: a pipeline for generating annotated images of animals in complex environments using Unreal engine. Nat. Commun. 14, 7195 (2023).
Zha, D., Bhat, Z. P., Lai, K. -H., Yang, F. & Hu, X. Data-centric ai: perspectives and challenges. In Proceedings of the 2023 SIAM International Conference on Data Mining (SDM), 945–948 (SIAM, 2023).
Gadre, S. Y. et al. Datacomp: In search of the next generation of multimodal datasets. Adv. Neural Inf. Process. Syst. 36, (2024).
Xu, H. et al. Demystifying CLIP data. In International Conf. on Learning Representations (2024).
Fang, A., et al. Data filtering networks. In The Twelfth International Conf. on Learning Representations (2023).
Tillmann, J. F., Hsu, A. I., Schwarz, M. K. & Yttri, E. A. A-soid, an active-learning platform for expert-guided, data-efficient discovery of behavior. Nat. Methods 21, 703–711 (2024).
Li, J., Keselman, M. & Shlizerman, E. OpenLabCluster: active learning based clustering and classification of animal behaviors in videos based on automatically extracted kinematic body keypoints. Preprint at bioRxiv https://doi.org/10.1101/2022.10.10.511660 (2022).
Yoo, D. & Kweon, I. S. Learning loss for active learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 93–102 (2019).
Sener, O. & Savarese, S. Active learning for convolutional neural networks: a core-set approach. In International Conf. on Learning Representations (2018).
Sun, P. et al. Dancetrack: multi-object tracking in uniform appearance and diverse motion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 20993–21002 (2022).
Zhao, J. et al. Tuber: tubelet transformer for video action detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 13598–13607 (2022).
Gritsenko, A. et al. End-to-end spatio-temporal action localisation with video transformers. In Proc. of the IEEE/CVF Conf. on Computer Vision and Pattern Recognition, 18373–18383 (2024).
Fan, H. et al. Multiscale vision transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 6824–6835 (2021).
Wang, R. et al. Masked video distillation: Rethinking masked feature modeling for self-supervised video representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 6312–6322 (2023).
Nilsson, S. R. et al. Simple behavioral analysis (SimBA) as a platform for explainable machine learning in behavioral neuroscience. Nat. Neurosci. 27, 1411–1424 (2024).
Sun, J. J. et al. MABe22: a multi-species multi-task benchmark for learned representations of behavior. In International Conference on Machine Learning, 32936–32990 (PMLR, 2023).
Zhou, M., Stoffl, L., Mathis, M. W. & Mathis, A. Rethinking pose estimation in crowds: Overcoming the detection information bottleneck and ambiguity. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 14689–14699 (2023).
Csurka, G., Dance, C. R., Fan, L., Willamowski, J. K. & Bray, C. Visual categorization with bags of keypoints. In European Conference on Computer Vision (2002).
Krizhevsky, A., Sutskever, I. & Hinton, G. E. Imagenet classification with deep convolutional neural networks. Adv. Neural Inf. Process. Syst. 25, (2012).
Bohnslav, J. P. et al. DeepEthogram, a machine learning pipeline for supervised behavior classification from raw pixels. eLife 10, e63377 (2021).
Tong, Z., Song, Y., Wang, J. & Wang, L. Videomae: masked autoencoders are data-efficient learners for self-supervised video pre-training. Adv. Neural Inf. Process. Syst. 35, 10078–10093 (2022).
Wang, L. et al. Videomae v2: Scaling video masked autoencoders with dual masking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 14549–14560 (2023).
Wang, Y. et al. InternVid: a large-scale video-text dataset for multimodal understanding and generation. In International Conf. on Learning Representations (2024).
Elezi, I., Yu, Z., Anandkumar, A., Leal-Taixe, L. & Alvarez, J. M. Not all labels are equal: rationalizing the labeling costs for training object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 14492–14501 (2022).
He, K., Gkioxari, G., Dollár, P. & Girshick, R. Mask R-CNN. In Proceedings of the IEEE International Conference on Computer Vision, 2961–2969 (2017).
Bommasani, R. et al. On the opportunities and risks of foundation models. Preprint at https://arxiv.org/abs/2108.07258 (2021).
Liu, H., Li, C., Wu, Q. & Lee, Y. Visual instruction tuning. Adv. Neural Inf. Process. Syst. 36, 34892–34916 (2023).
Microsoft. New models added to the Phi-3 family, available on Microsoft Azure. Microsoft https://azure.microsoft.com/en-us/blog/new-models-added-to-the-phi-3-family-available-on-microsoft-azure/ (accessed 21 May 2024).
Yang, Z. et al. The dawn of LMMs: preliminary explorations with GPT-4V (ision). Preprint at https://arxiv.org/abs/2309.17421 (2023).
Achiam, J. et al. GPT-4 technical report. Preprint at https://arxiv.org/abs/2303.08774 (2023).
Hinton, G. E., Vinyals, O. & Dean, J. Distilling the knowledge in a neural network. Preprint at https://arxiv.org/abs/1503.02531 (2014).
Ye, S., Lauer, J., Zhou, M., Mathis, A. & Mathis, M. AmadeusGPT: a natural language interface for interactive animal behavioral analysis. Adv. Neural Inf. Process. Syst. 36, (2024).
Antoun, M. & Asmar, D. Human object interaction detection: design and survey. Image Vision Comput. 130, 104617 (2022).
Zhu, G. et al. Scene graph generation: a comprehensive survey. Neurocomputing 566, 127052 (2024).
Acknowledgements
This work was funded by the Deutsche Forschungsgemeinschaft (German Research Foundation; Project-ID 454648639 – SFB 1528 – Cognition of Interaction, to R.V., T.L., S.D., M.N., D.M., J.F., J.O., O.S., P.M.K., C.F., A.G., S.T., H.S., F.W. and A.S.E.) and with NextGenerationEU funds from the European Union by the Federal Ministry of Education and Research under the funding code 16DKWN038 (to J.H.). Additional funding was provided by the Deutsche Forschungsgemeinschaft (grant no. 254142454/GRK2070 to D.M. and J.F., and grant no. 502807174/GRK2906 to V.H. and A.S.E.). We also acknowledge funding by the Leibniz Association through an Audacity Grant from the Leibniz ScienceCampus Primate Cognition (W45/2019 – Strategische Vernetzung, to J.O., O.S. and A.S.E.).
Author information
Authors and Affiliations
Contributions
R.V., T.L. and J.H. led and conducted the literature review, drafted figures, wrote the main parts of ‘Methods for primate behavior analysis’ as well as ‘Methods for effort-efficient learning’ and contributed to ‘Avenues for future research’. S.D., M.N. and V.H. contributed to individual sections of the paper (S.D.: self-supervised, semi-supervised and weakly supervised learning; M.N.: individual identification; VH: multi-animal tracking). D.M. provided detailed feedback on drafts. J.F., J.O., O.S., P.M.K., C.F., A.G., S.T. and H.S. contributed to the conceptualization of the Perspective, provided supervision and funding, and advised on key decisions. A.S.E. led the project, provided funding, took primary responsibility for key project decisions and wrote the main parts of ‘Avenues for future research’. All authors contributed to revising and proofreading the paper.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Peer review
Peer review information
Nature Methods thanks Shaokai Ye and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Primary Handling Editor: Nina Vogt, in collaboration with the Nature Methods team.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Supplementary Information
Additional examples for usage of multimodal large language models. Extension of Fig. 5, providing more examples for the use of multimodal large language models (here GPT4-V) to understand actions and interactions on images.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Vogg, R., Lüddecke, T., Henrich, J. et al. Computer vision for primate behavior analysis in the wild. Nat Methods 22, 1154–1166 (2025). https://doi.org/10.1038/s41592-025-02653-y
Received:
Accepted:
Published:
Issue date:
DOI: https://doi.org/10.1038/s41592-025-02653-y
This article is cited by
-
BaboonLand Dataset: Tracking Primates in the Wild and Automating Behaviour Recognition from Drone Videos
International Journal of Computer Vision (2025)