Abstract
Antigen receptor numbering allows delineation of antigen-binding regions of antibodies and T cell receptors, from sequence alone. Numbering is currently achieved by aligning to a reference set. This approach may result in different numbering depending on reference set used or fail on sequences from rare species or formats. We present a method (ANARCII) which requires no alignment step and is based on a Seq2Seq language model. ANARCII improves upon existing methods through more consistent numbering of key regions, robustness to truncations, generalisation to unseen species, and easier user installation. The lightweight architecture allows numbering of 90,000 sequences per minute on a high-end GPU. The software is available via web app (https://opig.stats.ox.ac.uk/webapps/sabdab-sabpred/sabpred/anarcii/), and package (https://github.com/oxpig/ANARCII). Ultimately ANARCII allows numbering of more antibody-like sequences, with better recovery of full-length regions from existing databases, and enables comparative analysis of new receptors not numbered by existing tools.
Acknowledgements
The authors would like to thank Oliver Turnbull, Carlos Outeiral and David Prihoda for their helpful suggestions and feedback. We would also like to thank Chris Thorpe, Benjamin McMaster, Bruce MacLachlan and Nele Quast for their helpful discussions on numbering of MHC/HLA (currently under development – available as a development branch on GitHub). The work was supported through research funding by Exscientia awarded to A.G.W., and Doctoral programme funding from the UK Engineering and Physical Sciences Research Council (EPSRC) awarded to S.A.R., G.L.G., H.L.C. and F.C.S (EP/S024093/1). For the purpose of Open Access, the author has applied a CC BY public copyright licence to any Author Accepted Manuscript version arising from this submission.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Competing interests
C.D. discloses membership of the Scientific Advisory Board of Fusion Antibodies and AI proteins, as well as a founder of Dalton. All other authors declare no conflict of interest.
AI disclosure
Generative AI tools, including ChatGPT and GitHub Copilot, were utilised to assist in code generation and error checking during the development of this project.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Greenshields-Watson, A., Agarwal, P., Robinson, S.A. et al. ANARCII enables alignment-free antigen receptor numbering using a generalised language model. Commun Biol (2026). https://doi.org/10.1038/s42003-026-10186-z
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s42003-026-10186-z