Computer vision for primate behavior analysis in the wild

Vogg, Richard; Lüddecke, Timo; Henrich, Jonathan; Dey, Sharmita; Nuske, Matthias; Hassler, Valentin; Murphy, Derek; Fischer, Julia; Ostner, Julia; Schülke, Oliver; Kappeler, Peter M.; Fichtel, Claudia; Gail, Alexander; Treue, Stefan; Scherberger, Hansjörg; Wörgötter, Florentin; Ecker, Alexander S.

doi:10.1038/s41592-025-02653-y

Perspective
Published: 10 April 2025

Computer vision for primate behavior analysis in the wild

Nature Methods volume 22, pages 1154–1166 (2025)Cite this article

2319 Accesses
8 Citations
14 Altmetric
Metrics details

Subjects

Abstract

Advances in computer vision and increasingly widespread video-based behavioral monitoring are currently transforming how we study animal behavior. However, there is still a gap between the prospects and practical application, especially in videos from the wild. In this Perspective, we aim to present the capabilities of current methods for behavioral analysis, while at the same time highlighting unsolved computer vision problems that are relevant to the study of animal behavior. We survey state-of-the-art methods for computer vision problems relevant to the video-based study of individualized animal behavior, including object detection, multi-animal tracking, individual identification and (inter)action understanding. We then review methods for effort-efficient learning, one of the challenges from a practical perspective. In our outlook on the emerging field of computer vision for animal behavior, we argue that the field should develop approaches to unify detection, tracking, identification and (inter)action understanding in a single, video-based framework.

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on SpringerLink
Instant access to full article PDF

Buy now

Prices may be subject to local taxes which are calculated during checkout

**Fig. 3: Schematic overview of effort-efficient learning techniques.**

**Fig. 4: Avenues for future research.**

**Fig. 5: Usage of multimodal large language models.**

Deep-learning-based identification, tracking, pose estimation and behaviour classification of interacting primates and mice in complex environments

Article 21 April 2022

Puzzle: taking livestock tracking to the next level

Article Open access 07 August 2024

Quantifying behavior to understand the brain

Article 09 November 2020

References

Couzin, I. D. & Heins, C. Emerging technologies for behavioral research in changing environments. Trends Ecol. Evol. 38, 346–354 (2023).
Article PubMed Google Scholar
Bateson, M. & Martin, P. Measuring Behaviour: an Introductory Guide (Cambridge Univ. Press, 2021).
Papadakis, V. M., Papadakis, I. E., Lamprianidou, F., Glaropoulos, A. & Kentouri, M. A computer-vision system and methodology for the analysis of fish behavior. Aquacult. Eng. 46, 53–59 (2012).
Article Google Scholar
Segalin, C. et al. The mouse action recognition system (MARS) software pipeline for automated analysis of social behaviors in mice. eLife 10, e63720 (2021).
Article CAS PubMed PubMed Central Google Scholar
Luxem, K. et al. Open-source tools for behavioral video analysis: setup, methods, and best practices. eLife 12, e79305 (2023).
Article CAS PubMed PubMed Central Google Scholar
Mathis, A. et al. DeepLabCut: markerless pose estimation of user-defined body parts with deep learning. Nat. Neurosci. 21, 1281–1289 (2018).
Article CAS PubMed Google Scholar
Lauer, J. et al. Multi-animal pose estimation, identification and tracking with DeepLabCut. Nat. Methods 19, 496–504 (2022).
Article CAS PubMed PubMed Central Google Scholar
Pereira, T. D. et al. SLEAP: a deep learning system for multi-animal pose tracking. Nat. Methods 19, 486–495 (2022).
Article CAS PubMed PubMed Central Google Scholar
Walter, T. & Couzin, I. D. TRex, a fast multi-animal tracking system with markerless identification, and 2D estimation of posture and visual fields. eLife 10, e64000 (2021).
Article PubMed PubMed Central Google Scholar
Graving, J. M. et al. DeepPoseKit, a software toolkit for fast and robust animal pose estimation using deep learning. eLife 8, e47994 (2019).
Article CAS PubMed PubMed Central Google Scholar
Jordan, G. A. et al. Automated system for training and assessing reaching and grasping behaviors in rodents. J. Neurosci. Methods 401, 109990 (2024).
Article PubMed Google Scholar
Wiltshire, C. et al. DeepWild: application of the pose estimation tool DeepLabCut for behaviour tracking in wild chimpanzees and bonobos. J. Anim. Ecol. 92, 1560–1574 (2023).
Article PubMed Google Scholar
Dendorfer, P. et al. MOT20: a benchmark for multi object tracking in crowded scenes. Preprint at https://arxiv.org/abs/2003.09003 (2020).
Lin, T. -Y. et al. Microsoft COCO: common objects in context. In European Conference on Computer Vision, 740–755 (2014).
Ji, J., Krishna, R., Fei-Fei, L. & Niebles, J. C. Action genome: actions as compositions of spatio-temporal scene graphs. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 10236–10247 (2020).
Li, A. et al. The AVA-Kinetics localized human actions video dataset. Preprint at https://arxiv.org/abs/2005.00214 (2020).
Mathis, M. W. & Mathis, A. Deep learning tools for the measurement of animal behavior in neuroscience. Curr. Opin. Neurobiol. 60, 1–11 (2020).
Article CAS PubMed Google Scholar
Pereira, T. D., Shaevitz, J. W. & Murthy, M. Quantifying behavior to understand the brain. Nat. Neurosci. 23, 1537–1549 (2020).
Article CAS PubMed PubMed Central Google Scholar
Perez, M. & Toler-Franklin, C. CNN-based action recognition and pose estimation for classifying animal behavior from videos: a survey. Preprint at https://arxiv.org/abs/2301.06187 (2023).
Kirillov, A. et al. Segment anything. In Proc. IEEE/CVF International Conf. on Computer Vision, 4015–4026 (2023).
Güler, R. A., Neverova, N. & Kokkinos, I. DensePose: dense human pose estimation in the wild. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, 7297–7306 (2018).
Girshick, R. B., Donahue, J., Darrell, T. & Malik, J. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 580–5877 (2013).
Ren, S., He, K., Girshick, R. B. & Sun, J. Faster R-CNN: towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 39, 1137–1149 (2015).
Article Google Scholar
He, K., Gkioxari, G., Dollár, P. & Girshick, R. B. Mask R-CNN. IEEE Trans. Pattern Anal. Mach. Intell. 42, 386–397 (2017).
Article Google Scholar
Liu, W. et al. SSD: single shot multibox detector. In Computer Vision – ECCV 2016: 14th European Conf,, Amsterdam, The Netherlands, October 11–14, 2016, Proc., Part I, 21–37 (Springer, 2016).
Redmon, J. & Farhadi, A. YOLO9000: better, faster, stronger. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 6517–6525 (2016).
Redmon, J. & Farhadi, A. YOLOv3: an incremental improvement. Preprint at https://arxiv.org/abs/1804.02767, (2018).
Ge, Z., Liu, S., Wang, F., Li, Z. & Sun, J. YOLOX: exceeding YOLO series in 2021. Preprint at https://arxiv.org/abs/2107.08430 (2021).
Zhou, X., Wang, D. & Krähenbühl, P. Objects as points. Preprint at https://arxiv.org/abs/1904.07850, (2019).
Tian, Z., Shen, C., Chen, H. & He, T. Fcos: Fully convolutional one-stage object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 9627–9636 (2019).
Vaswani, A. et al. Attention is all you need. In Neural Information Processing Systems (NeurIPS), 5998–6008 (2017).
Carion, N. et al. End-to-end object detection with transformers. In European Conf. on Computer Vision, 213–229 (2020).
Zhu, X. et al. Deformable DETR: deformable transformers for end-to-end object detection. In International Conf. on Learning Representations (2021).
Zhang, H. et al. DINO: DETR with improved denoising anchor boxes for end-to-end object detection. In International Conf. on Learning Representations (2022).
Schneider, S., Taylor, G. W. & Kremer, S. Deep learning object detection methods for ecological camera trap data. In The 15th Conference on Computer and Robot Vision (CRV), 321–328 (IEEE, 2018).
Guo, S. et al. Automatic identification of individual primates with deep learning techniques. iScience 23, 101412 (2020).
Article PubMed PubMed Central Google Scholar
Yang, X., Mirmehdi, M. & Burghardt, T. Great ape detection in challenging jungle camera trap footage via attention-based spatial and temporal feature blending. In Proc. of the IEEE/CVF International Conf. on Computer Vision Workshops, 0–0 (2019).
Roy, A. M., Bhaduri, J., Kumar, T. & Raj, K. WilDect-YOLO: an efficient and robust computer vision-based accurate object localization model for automated endangered wildlife detection. Ecol. Inform. 75, 101919 (2023).
Article Google Scholar
Yang, X., Burghardt, T. & Mirmehdi, M. Dynamic curriculum learning for great ape detection in the wild. Int. J. Comput. Vis. 131, 1163–1181 (2023).
Article Google Scholar
Duporge, I. et al. BaboonLand dataset: tracking primates in the wild and automating behaviour recognition from drone videos. Preprint at https://arxiv.org/abs/2405.17698 (2024).
Bain, M., Nagrani, A., Schofield, D. & Zisserman, A. Count, crop and recognise: fine-grained recognition in the wild. In 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW) 236–246 (2019).
Brookes, O. et al. PanAf20K: a large video dataset for wild ape detection and behaviour recognition. Int. J. Comput. Vis. 1–17 (2024).
Ma, X. et al. Chimpact: a longitudinal dataset for understanding chimpanzee behaviors. Adv. Neural Inf. Process. Syst. 36, 27501–27531 (2023).
Google Scholar
Ye, S. et al. SuperAnimal pretrained pose estimation models for behavioral analysis. Nat. Commun. 15, 5165 (2024).
Article CAS PubMed PubMed Central Google Scholar
Yao, Y. et al. OpenMonkeyChallenge: dataset and benchmark challenges for pose estimation of non-human primates. Int. J. Comput. Vis. 131, 243–258 (2023).
Article PubMed Google Scholar
Desai, N. P. et al. OpenApePose: a database of annotated ape photographs for pose estimation. eLife 12, RP86873 (2023).
Article CAS PubMed PubMed Central Google Scholar
Labuguen, R. T. et al. MacaquePose: a novel ‘in the wild’ macaque monkey pose dataset for markerless motion capture. Front. Behav. Neurosci. 14, 581154 (2020).
Bewley, A., Ge, Z., Ott, L., Ramos, F. & Upcroft, B. Simple online and realtime tracking. In 2016 IEEE International Conference on Image Processing (ICIP) 3464–3468 (2016).
Zhang, Y. et al. ByteTrack: multi-object tracking by associating every detection box. In European Conf. on Computer Vision, 1–21 (2022).
Maggiolino, G., Ahmad, A., Cao, J. & Kitani, K. Deep OC-SORT: multi-pedestrian tracking by adaptive re-identification. In 2023 IEEE International Conf. on Image Processing (ICIP), 3025–3029 (2023).
Aharon, N., Orfaig, R. & Bobrovsky, B. -Z. BoT-SORT: robust associations multi-pedestrian tracking. Preprint at https://arxiv.org/abs/2206.14651 (2022).
Zeng, F. et al. MOTR: end-to-end multiple-object tracking with transformer. In European Conference on Computer Vision, 659–675 (2022).
Zhang, Y., Wang, T. & Zhang, X. MOTRv2: bootstrapping end-to-end multi-object tracking by pretrained object detectors. In Proc. of the IEEE/CVF Conf. on Computer Vision and Pattern Recognition, 22056–22065 (2023).
Yan, F., Luo, W., Zhong, Y., Gan, Y. & Ma, L. Bridging the gap between end-to-end and non-end-to-end multi-object tracking. Preprint at https://arxiv.org/abs/2305.12724 (2023).
Zhang, L., Gao, J., Xiao, Z. & Fan, H. AnimalTrack: a benchmark for multi-animal tracking in the wild. Int. J. Comput. Vis. 131, 496–513 (2023).
Article Google Scholar
Pineda, R. R., Kubo, T., Shimada, M. & Ikeda, K. Deep MAnTra: deep learning-based multi-animal tracking for japanese macaques. Artif. Life Robot. 28, 127–138 (2023).
Article Google Scholar
Vogg, R. et al. PriMAT: A robust multi-animal tracking model for primates in the wild. Preprint at bioRxiv https://doi.org/10.1101/2024.08.21.607881 (2024).
Frey, S., Fisher, J. T., Burton, A. C. & Volpe, J. P. Investigating animal activity patterns and temporal niche partitioning using camera-trap data: challenges and opportunities. Remote Sens. Ecol. Conserv. 3, 123–132.
Schofield, D. P. et al. Automated face recognition using deep neural networks produces robust primate social networks and sociality measures. Methods Ecol. Evol. 14, 1937–1951 (2023).
Article Google Scholar
Royle, J. A. in Camera Traps in Animal Ecology: Methods and Analyses (eds. O’Connell, A. F. et al.) 163–190 (Springer, 2011).
Vidal, M., Wolf, N., Rosenberg, B., Harris, B. P. & Mathis, A. Perspectives on individual animal identification from biology and computer vision. Integr. Comp. Biol. 61, 900–916 (2021).
Article PubMed PubMed Central Google Scholar
Marks, M. et al. Deep-learning-based identification, tracking, pose estimation and behaviour classification of interacting primates and mice in complex environments. Nat. Mach. Intell. 4, 331–340 (2022).
Article PubMed PubMed Central Google Scholar
Freytag, A. et al. Chimpanzee faces in the wild: log-euclidean CNNs for predicting identities and attributes of primates. In Pattern Recognition: 38th German Conference, GCPR (eds. Rosenhahn, B. et al.) 9796 (Springer, 2016).
Brookes, O. & Burghardt, T. A dataset and application for facial recognition of individual gorillas in zoo environments. Preprint at https://arxiv.org/abs/2012.04689 (2021).
Tieo, S. et al. The mandrillus face database: a portrait image database for individual and sex recognition, and age prediction in a non-human primate. Data Brief. 47, 108939 (2023).
Article CAS PubMed PubMed Central Google Scholar
Parkhi, O., Vedaldi, A. & Zisserman, A. Deep face recognition. In BMVC Proceedings Of The British Machine Vision Conference 2015 (British Machine Vision Association, 2015).
Schroff, F., Kalenichenko, D. & Philbin, J. Facenet: a unified embedding for face recognition and clustering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 815–823 (2015).
Deng, J., Guo, J., Xue, N. & Zafeiriou, S. Arcface: additive angular margin loss for deep face recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 4690–4699 (2019).
Deb, D. et al. Face recognition: primates in the wild. In 2018 IEEE 9th International Conference on Biometrics Theory, Applications and Systems (BTAS), 1–10 (IEEE, 2018).
Charpentier, M. J. et al. Same father, same face: deep learning reveals selection for signaling kinship in a wild primate. Sci. Adv. 6, eaba3274 (2020).
Article CAS PubMed PubMed Central Google Scholar
Schofield, D. et al. Chimpanzee face recognition from videos in the wild using deep learning. Sci. Adv. 5, eaaw0736 (2019).
Article PubMed PubMed Central Google Scholar
Brookes, O. et al. Evaluating cognitive enrichment for zoo-housed gorillas using facial recognition. Front. Vet. Sci. 9, 886720 (2022).
Article PubMed PubMed Central Google Scholar
Paulet, J. et al. Deep learning for automatic facial detection and recognition in japanese macaques: illuminating social networks. Primates 65, 265–279 (2024).
Mathis, A., Schneider, S., Lauer, J. & Mathis, M. W. A primer on motion capture with deep learning: principles, pitfalls, and perspectives. Neuron 108, 44–65 (2020).
Article CAS PubMed Google Scholar
Qiu, Z., Yao, T. & Mei, T. Learning spatio-temporal representation with pseudo-3D residual networks. In Proceedings of the IEEE International Conference on Computer Vision, 5533–5541 (2017).
Lee, Y., Kim, H. -I., Yun, K. & Moon, J. Diverse temporal aggregation and depthwise spatiotemporal factorization for efficient video classification. IEEE Access 9, 163054–163064 (2021).
Article Google Scholar
Feichtenhofer, C., Fan, H., Malik, J. & He, K. Slowfast networks for video recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 6202–6211 (2019).
Arnab, A. et al. Vivit: a video vision transformer. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 6836–6846 (2021).
Bertasius, G., Wang, H. & Torresani, L. Is space-time attention all you need for video understanding? ICML 2, 4 (2021).
Google Scholar
Ryali, C. et al. Hiera: a hierarchical vision transformer without the bells-and-whistles. In International Conf. on Machine Learning, 29441–29454 (2023).
Wu, C. -Y. et al. Memvit: memory-augmented multiscale vision transformer for efficient long-term video recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 13587–13597 (2022).
Li, K. et al. UniFormer: unified transformer for efficient spatial-temporal representation learning. In International Conf. on Learning Representations (2022).
Patrick, M. et al. Keeping your eye on the ball: trajectory attention in video transformers. Adv. Neural Inf. Process. Syst. 34, 12493–12506 (2021).
Long, F. et al. Stand-alone inter-frame attention in video models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 3192–3201 (2022).
Materzynska, J. et al. Something-else: compositional action recognition with spatial-temporal interaction networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 1049–1059 (2020).
Tian, Y., Yan, Y., Zhai, G., Guo, G. & Gao, Z. EAN: event adaptive network for enhanced action recognition. Int. J. Comput. Vis. 130, 2453–2471 (2022).
Article Google Scholar
Faure, G. J., Chen, M. -H. & Lai, S. -H. Holistic interaction transformer network for action detection. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 3340–3350 (2023).
Cheng, F. & Bertasius, G. TallFormer: temporal action localization with a long-memory transformer. In European Conf. on Computer Vision, 503–521 (2022).
Zhang, C. L., Wu, J. & Li, Y. ActionFormer: localizing moments of actions with transformers. In European Conf. on Computer Vision, 492–510 (2022).
Tang, T. N., Kim, K. & Sohn, K. TemporalMaxer: maximize temporal context with only max pooling for temporal action localization. Preprint at https://arxiv.org/abs/2303.09055 (2023).
Shi, D. et al. TriDet: temporal action detection with relative boundary modeling. In Proc. of the IEEE/CVF Conf. on Computer Vision and Pattern Recognition, 18857–18866 (2023).
Wang, J., Huang, J., Zhang, C. & Deng, Z. Cross-modality time-variant relation learning for generating dynamic scene graphs. In 2023 IEEE International Conf. on Robotics and Automation (ICRA), 8231–8238 (2023).
Feng, S., Mostafa, H., Nassar, M., Majumdar, S. & Tripathi, S. Exploiting long-term dependencies for generating dynamic scene graphs. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 5130–5139 (2023).
Gao, K., Chen, L., Niu, Y., Shao, J. & Xiao, J. Classification-then-grounding: reformulating video scene graphs as temporal bipartite graphs. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 19497–19506 (2022).
Wei, M., Chen, L., Ji, W., Yue, X. & Zimmermann, R. In defense of clip-based video relation detection. IEEE Trans. Image Process. 33 (2024).
Sakib, F. & Burghardt, T. Visual recognition of great ape behaviours in the wild. Preprint at https://arxiv.org/abs/2011.10759 (2020).
Lei, Y. et al. Postural behavior recognition of captive nocturnal animals based on deep learning: a case study of Bengal slow loris. Sci. Rep. 12, 7738 (2022).
Article CAS PubMed PubMed Central Google Scholar
Fuchs, M., Genty, E., Bangerter, A., Zuberbühler, K. & Cotofrei, P. From forest to zoo: great ape behavior recognition with ChimpBehave. Preprint at https://arxiv.org/abs/2405.20025 (2024).
Chen, J. et al. Mammalnet: a large-scale video benchmark for mammal recognition and behavior understanding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 13052–13061 (2023).
Bain, M. et al. Automated audiovisual behavior recognition in wild primates. Sci. Adv. 7, eabi4883 (2021).
Article PubMed PubMed Central Google Scholar
Brookes, O., Mirmehdi, M., Kühl, H. & Burghardt, T. Triple-stream deep metric learning of great ape behavioural actions. Preprint at https://arxiv.org/abs/2301.02642 (2023).
Brookes, O., Mirmehdi, M., Kuhl, H. & Burghardt, T. ChimpVLM: ethogram-enhanced chimpanzee behaviour recognition. Preprint at https://arxiv.org/abs/2404.08937 (2024).
Zhao, L. et al. Visual recognition of great ape behaviours in the wild. In Proceedings of the 41st International Conf. on Machine Learning, 60785–60811 (PMLR, 2024).
Zamir, A. R. et al. Taskonomy: disentangling task transfer learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 3712–3722 (2018).
Mensink, T., Uijlings, J., Kuznetsova, A., Gygli, M. & Ferrari, V. Factors of influence for transfer learning across diverse appearance domains and task types. IEEE Trans. Pattern Anal. Mach. Intell. 44, 9298–9314 (2021).
Article Google Scholar
Nayman, N., Golbert, A., Noy, A. & Zelnik-Manor, L. Diverse imagenet models transfer better. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 1914–1925 (2024).
Salman, H., Ilyas, A., Engstrom, L., Kapoor, A. & Madry, A. Do adversarially robust imagenet models transfer better? Adv. Neural Inf. Process. Syst. 33, 3533–3545 (2020).
Google Scholar
Russakovsky, O. et al. ImageNet large scale visual recognition challenge. Int. J. Comput. Vis. 115, 211–252 (2014).
Article Google Scholar
He, K. et al. Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 16000–16009 (2022).
Chen, T., Kornblith, S., Norouzi, M. & Hinton, G. A simple framework for contrastive learning of visual representations. In International Conference on Machine Learning, 1597–1607 (PMLR, 2020).
Chen, X. & He, K. Exploring simple siamese representation learning. in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 15750–15758 (2021).
Radford, A. et al. Learning transferable visual models from natural language supervision. In International Conf. on Machine Learning, 8748–8763 (PMLR, 2021).
Ju, C., Han, T., Zheng, K., Zhang, Y. & Xie, W. Prompting visual-language models for efficient video understanding. In European Conference on Computer Vision, 105–124 (Springer, 2022).
Lin, Z. et al. Frozen clip models are efficient video learners. In European Conference On Computer Vision, 388–404 (Springer, 2022).
Jing, Y., Wang, C., Zhang, R., Liang, K. & Ma, Z. Category-specific prompts for animal action recognition with pretrained vision-language models. In Proceedings of the 31st ACM International Conference on Multimedia, 5716–5724 (2023).
Yang, J., Zeng, A., Zhang, R. & Zhang, L. UniPose: detecting any keypoints. Preprint at https://arxiv.org/abs/2310.08530 (2023).
Lüddecke, T. & Ecker, A. Image segmentation using text and image prompts. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 7086–7096 (2022).
Zhang, R. et al. Pointclip: Point cloud understanding by clip. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 8552–8562 (2022).
Gabeff, V., Rußwurm, M., Tuia, D. & Mathis, A. WildCLIP: scene and animal attribute retrieval from camera trap data with domain-adapted vision-language models. Int. J. Comput. Vis. 1–17 (2024).
Berthelot, D. et al. Mixmatch: a holistic approach to semi-supervised learning. Adv. Neural Inf. Process. Syst. 32, (2019).
Sohn, K. et al. Fixmatch: simplifying semi-supervised learning with consistency and confidence. Adv. Neural Inf. Process. Syst. 33, 596–608 (2020).
Google Scholar
Oliver, A., Odena, A., Raffel, C. A., Cubuk, E. D. & Goodfellow, I. Realistic evaluation of deep semi-supervised learning algorithms. Adv. Neural Inf. Process. Syst. 31, (2018).
Bearman, A., Russakovsky, O., Ferrari, V. & Fei-Fei, L. What’s the point: semantic segmentation with point supervision. In European Conference on Computer Vision, 549–565 (Springer, 2016).
Greff, K. et al. Kubric: a scalable dataset generator. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 3749–3761 (2022).
Dosovitskiy, A. et al. FlowNet: learning optical flow with convolutional networks. In 2015 IEEE International Conference on Computer Vision (ICCV), 2758–2766 (2015).
Karaev, N. et al. CoTracker: it is better to track together. In European Conf. on Computer Vision, 18–35 (2024).
Doersch, C. et al. TAPIR: tracking any point with per-frame initialization and temporal refinement. In Proc. of the IEEE/CVF International Conf. on Computer Vision, 10061–10072 (2023).
Plum, F., Bulla, R., Beck, H. K., Imirzian, N. & Labonte, D. replicAnt: a pipeline for generating annotated images of animals in complex environments using Unreal engine. Nat. Commun. 14, 7195 (2023).
Article CAS PubMed PubMed Central Google Scholar
Zha, D., Bhat, Z. P., Lai, K. -H., Yang, F. & Hu, X. Data-centric ai: perspectives and challenges. In Proceedings of the 2023 SIAM International Conference on Data Mining (SDM), 945–948 (SIAM, 2023).
Gadre, S. Y. et al. Datacomp: In search of the next generation of multimodal datasets. Adv. Neural Inf. Process. Syst. 36, (2024).
Xu, H. et al. Demystifying CLIP data. In International Conf. on Learning Representations (2024).
Fang, A., et al. Data filtering networks. In The Twelfth International Conf. on Learning Representations (2023).
Tillmann, J. F., Hsu, A. I., Schwarz, M. K. & Yttri, E. A. A-soid, an active-learning platform for expert-guided, data-efficient discovery of behavior. Nat. Methods 21, 703–711 (2024).
Article CAS PubMed Google Scholar
Li, J., Keselman, M. & Shlizerman, E. OpenLabCluster: active learning based clustering and classification of animal behaviors in videos based on automatically extracted kinematic body keypoints. Preprint at bioRxiv https://doi.org/10.1101/2022.10.10.511660 (2022).
Yoo, D. & Kweon, I. S. Learning loss for active learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 93–102 (2019).
Sener, O. & Savarese, S. Active learning for convolutional neural networks: a core-set approach. In International Conf. on Learning Representations (2018).
Sun, P. et al. Dancetrack: multi-object tracking in uniform appearance and diverse motion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 20993–21002 (2022).
Zhao, J. et al. Tuber: tubelet transformer for video action detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 13598–13607 (2022).
Gritsenko, A. et al. End-to-end spatio-temporal action localisation with video transformers. In Proc. of the IEEE/CVF Conf. on Computer Vision and Pattern Recognition, 18373–18383 (2024).
Fan, H. et al. Multiscale vision transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 6824–6835 (2021).
Wang, R. et al. Masked video distillation: Rethinking masked feature modeling for self-supervised video representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 6312–6322 (2023).
Nilsson, S. R. et al. Simple behavioral analysis (SimBA) as a platform for explainable machine learning in behavioral neuroscience. Nat. Neurosci. 27, 1411–1424 (2024).
Sun, J. J. et al. MABe22: a multi-species multi-task benchmark for learned representations of behavior. In International Conference on Machine Learning, 32936–32990 (PMLR, 2023).
Zhou, M., Stoffl, L., Mathis, M. W. & Mathis, A. Rethinking pose estimation in crowds: Overcoming the detection information bottleneck and ambiguity. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 14689–14699 (2023).
Csurka, G., Dance, C. R., Fan, L., Willamowski, J. K. & Bray, C. Visual categorization with bags of keypoints. In European Conference on Computer Vision (2002).
Krizhevsky, A., Sutskever, I. & Hinton, G. E. Imagenet classification with deep convolutional neural networks. Adv. Neural Inf. Process. Syst. 25, (2012).
Bohnslav, J. P. et al. DeepEthogram, a machine learning pipeline for supervised behavior classification from raw pixels. eLife 10, e63377 (2021).
Article CAS PubMed PubMed Central Google Scholar
Tong, Z., Song, Y., Wang, J. & Wang, L. Videomae: masked autoencoders are data-efficient learners for self-supervised video pre-training. Adv. Neural Inf. Process. Syst. 35, 10078–10093 (2022).
Google Scholar
Wang, L. et al. Videomae v2: Scaling video masked autoencoders with dual masking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 14549–14560 (2023).
Wang, Y. et al. InternVid: a large-scale video-text dataset for multimodal understanding and generation. In International Conf. on Learning Representations (2024).
Elezi, I., Yu, Z., Anandkumar, A., Leal-Taixe, L. & Alvarez, J. M. Not all labels are equal: rationalizing the labeling costs for training object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 14492–14501 (2022).
He, K., Gkioxari, G., Dollár, P. & Girshick, R. Mask R-CNN. In Proceedings of the IEEE International Conference on Computer Vision, 2961–2969 (2017).
Bommasani, R. et al. On the opportunities and risks of foundation models. Preprint at https://arxiv.org/abs/2108.07258 (2021).
Liu, H., Li, C., Wu, Q. & Lee, Y. Visual instruction tuning. Adv. Neural Inf. Process. Syst. 36, 34892–34916 (2023).
Google Scholar
Microsoft. New models added to the Phi-3 family, available on Microsoft Azure. Microsoft https://azure.microsoft.com/en-us/blog/new-models-added-to-the-phi-3-family-available-on-microsoft-azure/ (accessed 21 May 2024).
Yang, Z. et al. The dawn of LMMs: preliminary explorations with GPT-4V (ision). Preprint at https://arxiv.org/abs/2309.17421 (2023).
Achiam, J. et al. GPT-4 technical report. Preprint at https://arxiv.org/abs/2303.08774 (2023).
Hinton, G. E., Vinyals, O. & Dean, J. Distilling the knowledge in a neural network. Preprint at https://arxiv.org/abs/1503.02531 (2014).
Ye, S., Lauer, J., Zhou, M., Mathis, A. & Mathis, M. AmadeusGPT: a natural language interface for interactive animal behavioral analysis. Adv. Neural Inf. Process. Syst. 36, (2024).
Antoun, M. & Asmar, D. Human object interaction detection: design and survey. Image Vision Comput. 130, 104617 (2022).
Zhu, G. et al. Scene graph generation: a comprehensive survey. Neurocomputing 566, 127052 (2024).

Download references

Acknowledgements

This work was funded by the Deutsche Forschungsgemeinschaft (German Research Foundation; Project-ID 454648639 – SFB 1528 – Cognition of Interaction, to R.V., T.L., S.D., M.N., D.M., J.F., J.O., O.S., P.M.K., C.F., A.G., S.T., H.S., F.W. and A.S.E.) and with NextGenerationEU funds from the European Union by the Federal Ministry of Education and Research under the funding code 16DKWN038 (to J.H.). Additional funding was provided by the Deutsche Forschungsgemeinschaft (grant no. 254142454/GRK2070 to D.M. and J.F., and grant no. 502807174/GRK2906 to V.H. and A.S.E.). We also acknowledge funding by the Leibniz Association through an Audacity Grant from the Leibniz ScienceCampus Primate Cognition (W45/2019 – Strategische Vernetzung, to J.O., O.S. and A.S.E.).

Author information

These authors contributed equally: Richard Vogg, Timo Lüddecke, Jonathan Henrich.

Authors and Affiliations

Institute of Computer Science and Campus Institute Data Science, University of Göttingen, Göttingen, Germany
Richard Vogg, Timo Lüddecke, Sharmita Dey, Valentin Hassler & Alexander S. Ecker
Chairs of Statistics and Econometrics and Campus Institute Data Science, University of Göttingen, Göttingen, Germany
Jonathan Henrich
Department for Computational Neuroscience, Third Physics Institute, University of Göttingen, Göttingen, Germany
Matthias Nuske & Florentin Wörgötter
Cognitive Ethology Laboratory, German Primate Center, Leibniz Institute for Primate Research, Göttingen, Germany
Derek Murphy & Julia Fischer
Department for Primate Cognition, Johann-Friedrich-Blumenbach Institute, University of Göttingen, Göttingen, Germany
Derek Murphy & Julia Fischer
Leibniz ScienceCampus, German Primate Center, Leibniz Institute for Primate Research, Göttingen, Germany
Julia Fischer, Julia Ostner, Oliver Schülke, Peter M. Kappeler, Claudia Fichtel, Alexander Gail, Stefan Treue, Hansjörg Scherberger, Florentin Wörgötter & Alexander S. Ecker
Bernstein Center for Computational Neuroscience, University of Göttingen, Göttingen, Germany
Julia Fischer, Alexander Gail, Stefan Treue, Hansjörg Scherberger, Florentin Wörgötter & Alexander S. Ecker
Behavioral Ecology Department, University of Göttingen, Göttingen, Germany
Julia Ostner & Oliver Schülke
Social Evolution in Primates Group, German Primate Center, Leibniz Institute for Primate Research, Göttingen, Germany
Julia Ostner & Oliver Schülke
Behavioral Ecology & Sociobiology Unit, German Primate Center, Leibniz Institute for Primate Research, Göttingen, Germany
Peter M. Kappeler & Claudia Fichtel
Department of Sociobiology/Anthropology, University of Göttingen, Göttingen, Germany
Peter M. Kappeler
Sensorimotor Group, German Primate Center, Leibniz Institute for Primate Research, Göttingen, Germany
Alexander Gail
Sensorimotor Neuroscience and Neuroprosthetics, Georg-Elias-Müller Institute, University of Göttingen, Göttingen, Germany
Alexander Gail
Cognitive Neuroscience Laboratory, German Primate Center, Leibniz Institute for Primate Research, Göttingen, Germany
Stefan Treue
Biological Psychology & Cognitive Neuroscience, Georg-Elias-Müller-Institute of Psychology, University of Göttingen, Göttingen, Germany
Stefan Treue
Primate Neurobiology, Johann-Friedrich-Blumenbach-Institute for Zoology & Anthropology, University of Göttingen, Göttingen, Germany
Hansjörg Scherberger
Neurobiology Laboratory, German Primate Center, Leibniz Institute for Primate Research, Göttingen, Germany
Hansjörg Scherberger
Max Planck Institute for Dynamics and Self-Organization, Göttingen, Germany
Alexander S. Ecker

Authors

Richard Vogg
View author publications
Search author on:PubMed Google Scholar
Timo Lüddecke
View author publications
Search author on:PubMed Google Scholar
Jonathan Henrich
View author publications
Search author on:PubMed Google Scholar
Sharmita Dey
View author publications
Search author on:PubMed Google Scholar
Matthias Nuske
View author publications
Search author on:PubMed Google Scholar
Valentin Hassler
View author publications
Search author on:PubMed Google Scholar
Derek Murphy
View author publications
Search author on:PubMed Google Scholar
Julia Fischer
View author publications
Search author on:PubMed Google Scholar
Julia Ostner
View author publications
Search author on:PubMed Google Scholar
Oliver Schülke
View author publications
Search author on:PubMed Google Scholar
Peter M. Kappeler
View author publications
Search author on:PubMed Google Scholar
Claudia Fichtel
View author publications
Search author on:PubMed Google Scholar
Alexander Gail
View author publications
Search author on:PubMed Google Scholar
Stefan Treue
View author publications
Search author on:PubMed Google Scholar
Hansjörg Scherberger
View author publications
Search author on:PubMed Google Scholar
Florentin Wörgötter
View author publications
Search author on:PubMed Google Scholar
Alexander S. Ecker
View author publications
Search author on:PubMed Google Scholar

Contributions

R.V., T.L. and J.H. led and conducted the literature review, drafted figures, wrote the main parts of ‘Methods for primate behavior analysis’ as well as ‘Methods for effort-efficient learning’ and contributed to ‘Avenues for future research’. S.D., M.N. and V.H. contributed to individual sections of the paper (S.D.: self-supervised, semi-supervised and weakly supervised learning; M.N.: individual identification; VH: multi-animal tracking). D.M. provided detailed feedback on drafts. J.F., J.O., O.S., P.M.K., C.F., A.G., S.T. and H.S. contributed to the conceptualization of the Perspective, provided supervision and funding, and advised on key decisions. A.S.E. led the project, provided funding, took primary responsibility for key project decisions and wrote the main parts of ‘Avenues for future research’. All authors contributed to revising and proofreading the paper.

Corresponding author

Correspondence to Alexander S. Ecker.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Nature Methods thanks Shaokai Ye and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Primary Handling Editor: Nina Vogt, in collaboration with the Nature Methods team.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information

Additional examples for usage of multimodal large language models. Extension of Fig. 5, providing more examples for the use of multimodal large language models (here GPT4-V) to understand actions and interactions on images.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Vogg, R., Lüddecke, T., Henrich, J. et al. Computer vision for primate behavior analysis in the wild. Nat Methods 22, 1154–1166 (2025). https://doi.org/10.1038/s41592-025-02653-y

Download citation

Received: 22 January 2024
Accepted: 28 February 2025
Published: 10 April 2025
Issue date: June 2025
DOI: https://doi.org/10.1038/s41592-025-02653-y

This article is cited by

BaboonLand Dataset: Tracking Primates in the Wild and Automating Behaviour Recognition from Drone Videos
- Isla Duporge
- Maksim Kholiavchenko
- Charles V Stewart
International Journal of Computer Vision (2025)

Computer vision for primate behavior analysis in the wild

Subjects

Abstract

Access options

Similar content being viewed by others

Deep-learning-based identification, tracking, pose estimation and behaviour classification of interacting primates and mice in complex environments

Puzzle: taking livestock tracking to the next level

Quantifying behavior to understand the brain

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Peer review

Peer review information

Additional information

Supplementary information

Supplementary Information

Rights and permissions

About this article

Cite this article

This article is cited by

BaboonLand Dataset: Tracking Primates in the Wild and Automating Behaviour Recognition from Drone Videos

Search

Quick links

Subjects

Abstract

Access options

Similar content being viewed by others

Deep-learning-based identification, tracking, pose estimation and behaviour classification of interacting primates and mice in complex environments

Puzzle: taking livestock tracking to the next level

Quantifying behavior to understand the brain

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Peer review

Peer review information

Additional information

Supplementary information

Supplementary Information

Rights and permissions

About this article

Cite this article

Share this article

This article is cited by

BaboonLand Dataset: Tracking Primates in the Wild and Automating Behaviour Recognition from Drone Videos

Search

Quick links