Conserved heavy/light contacts and germline preferences revealed by a large-scale analysis of natively paired human antibody sequences and structural data

Dudzic, Pawel; Chomicz, Dawid; Bielska, Weronika; Jaszczyszyn, Igor; Zieliński, Michał; Janusz, Bartosz; Wróbel, Sonia; Le Pannérer, Marguerite-Marie; Philips, Andrew; Ponraj, Prabakaran; Kumar, Sandeep; Krawczyk, Konrad

doi:10.1038/s42003-025-08388-y

Download PDF

Article
Open access
Published: 26 July 2025

Conserved heavy/light contacts and germline preferences revealed by a large-scale analysis of natively paired human antibody sequences and structural data

Communications Biology volume 8, Article number: 1110 (2025) Cite this article

8789 Accesses
7 Citations
4 Altmetric
Metrics details

Subjects

This article has been updated

Abstract

Understanding the pairing preferences and structural interactions between antibody heavy and light chains can enhance our ability to design more effective and specific therapeutic antibodies. Insights from natural antibody repertoires and conserved contact sites help reduce autoreactivity and improve drug safety and efficacy. Current databases represent only a limited portion of the estimated diversity of unique paired antibody molecules. To address this, we introduce PairedAbNGS, a novel database with paired heavy/light antibody chains. To our knowledge, this is the largest resource for paired natural antibody sequences with 58 bioprojects and over 14 million assembled productive sequences. Using this dataset, we investigated heavy and light chain variable (V) gene pairing preferences and found significant biases beyond gene usage frequencies, possibly due to receptor editing favoring less autoreactive combinations. Analyzing the available antibody structures from the Protein Data Bank, we studied conserved contact residues between heavy and light chains, particularly interactions between the CDR3 region of one chain and the FWR2 region of the opposite chain. Examination of amino acid pairs at key contact sites revealed significant deviations of amino acids distributions compared to random pairings, in the heavy chain’s CDR3 region contacting the opposite chain, indicating specific interactions might be crucial for proper chain pairing. This observation is further reinforced by preferential IGHV-IGLJ and IGLV-IGHJ pairing preferences. We hope that both our resources and the findings would contribute to improving the engineering of biological drugs. We make the database accessible at https://naturalantibody.com/paired-ab-ngs as a valuable tool for biological and machine-learning applications.

ImmunoMatch learns and predicts cognate pairing of heavy and light immunoglobulin chains

Article Open access 18 November 2025

Functional antibodies exhibit light chain coherence

Article Open access 26 October 2022

Construction of IgG–Fab² bispecific antibody via intein-mediated protein trans-splicing reaction

Article Open access 25 September 2023

Introduction

Therapeutic antibodies are rapidly emerging as the most successful class of biotherapeutics, with their yearly number of regulatory approvals reaching the same levels as the small molecule drugs¹. However, the initial generation of antibodies against a given antigen primarily requires the use of experimental methods, such as animal immunization, hybridomas, or phage/yeast display libraries².

In addition to being costly and time-consuming, experimental methods only work well for antigens that are “well-behaved” under laboratory conditions³. That is, their expression yields are high, in addition to them being conformationally stable and highly soluble in vitro. In the case of challenging targets such as GPCRs, claudins and other membrane-associated proteins that demonstrate conformational instability and/or low solubility outside of their cellular milieu, the task of producing sufficient amounts of antigens to initiate experimental generation of antibody binders becomes daunting.

To overcome these limitations and to expand the antigen space druggable by biotherapeutics to include the antigens where the experimental approaches may not work well, it is crucial to explore novel routes of antibody generation. Computation can be very helpful in this regard. This is because the availability of large antibody repertoires via next-generation sequencing (NGS) has coincided with the immense success of methods in machine learning theory and generalizations^4,5. Drawing from publicly available antibody sequencing data, either by data mining or machine learning modeling, it is now feasible to devise small and focused antigen-specific as well as larger antigen-agnostic antibody libraries^6,7. However, this endeavor requires careful construction and curation of underlying antibody sequences for deep learning, along with benchmarking different generative AI algorithms for their suitability towards specific biological purposes.

Paired antibody sequence databases are crucial for advancing our understanding of how immune responses are initiated and developed in the human body when challenged by an antigen. A crucial question is as follows: Development of IgG antibodies against an antigen requires class switching (IgM → IgG) and affinity maturation⁸. In the realm of antibody engineering, affinity maturation is typically understood to rely on processes such as V(D)J recombination and somatic hypermutation (Mishra and Mariuzza, 2018). However, distinct from the natural germinal center response, it remains an open question whether this engineered approach also benefits from the deliberate selection of specific germline pairings between the light and heavy chains⁹?. This question has been so far neglected because of lack of data available for naive B-cell and immunized antibody repertoires. Understanding this question is crucial for developing therapeutic antibodies and improving reagent and diagnostic antibodies. However, existing databases face limitations in diversity, data quality, and accessibility. The original version of OAS only contained unpaired reads¹⁰. Subsequent update collated ca. 120k paired sequences from five studies. The resource is regularly updated; however, certain studies such as one by Jaffe et al. account for most of its diversity¹¹. The presence of memory cells and the phenomenon of functional coherence can further limit the overall diversity represented in the dataset. We have recently demonstrated that by automatic mining of public repositories, it is possible to identify a large number of antibody-containing depositions¹². Though at the time, because of the technical simplicity of processing unpaired reads and the corresponding complexity of doing so with the paired datasets, we only made unpaired data available.

To address the lack of diverse paired antibody data, we created a database of human and mouse-paired antibody sequences (PairedAbNGS) derived from publicly available data in the NCBI Sequence Read Archive (SRA), focusing specifically on projects utilizing high throughput single cell analyses via the 10x Chromium 5’ pipeline. Single-cell technology has revolutionized our ability to capture and analyze individual cells, providing unprecedented insights into cellular heterogeneity and function¹³. In the context of antibody research, single-cell RNA sequencing (scRNA-seq) enables the simultaneous capture of paired heavy and light chain sequences from individual B-cells, overcoming the limitations of bulk sequencing approaches¹⁴. Previously, barcoded technology microfluidic methods for single-B-cell isolation led to the identification of millions of paired human antibody sequences, as demonstrated in studies by Rajan et al.¹⁵ and Wang et al.¹⁶. However, the actual number of unique sequences recovered may be influenced by factors such as the diversity of the B-cell population being studied, the efficiency of the isolation and sequencing processes, and the depth of sequencing coverage. The 10x Chromium 5’ pipeline, in particular, offers high-throughput single-cell analysis with improved full-length transcript coverage, making it an ideal platform for generating paired antibody sequences. By leveraging this technology and curating data from multiple studies, our database aims to provide the clearest landscape to date on the natural pairing of heavy and light chains among antibodies.

It is known that heavy/light preferences affect the binding of the antibodies to antigens^17,18, with multiple computational studies on the smaller datasets. Via such studies certain genetic preferences, as well as key residues on the interface affecting the VH/VL conformation were proposed^{19,20,21,22,23}. In addition, heavy and light chain pairings belonging to IGHV3 and IGKV1 genetic loci are most frequent among the marketed antibody-based biotherapeutics.

Leveraging the largest paired antibody collection to date, we have studied the pairing preferences at the gene and individual residue levels. We explored inter-chain interaction patterns, where we observed that the CDR3 of each chain is in contact with the FWR2 of the opposing chain. We hope that our analysis and the dataset will act as a good basis for future studies of pairing preferences and will aid in antibody engineering.

Results

Overview of database curation

NCBI SRA served as the primary source of sequencing data, with metadata imported from EBI ENA in May 2024. We focused on projects containing sequencing experiments labeled as “TRANSCRIPTOMIC SINGLE CELL” and used a large language model (Gemini) to identify antibody-related studies²⁴. This process involved creating a specific prompt for the Gemma language model²⁵ and manually verifying the results. Only projects using the 10x Chromium 5’ library preparation and focusing on human or mouse data were included. We identified 58 suitable studies containing 2482 sequencing experiments, resulting in over 14 million assembled productive sequences. Reads were processed using sra-toolkit and TRUST4²⁶, with annotation performed using the RIOT²⁷ program with OGRDB²⁸ as a primary source of gene references. The extracted data were divided by barcodes, and only productive (without the stop codon in translation and frameshift between V and J gene segments) chains were retained as annotated by RIOT flag “productive”. Sequences were sorted by coverage, and only the top heavy and light chain pair for each barcode was kept. In most cases, allelic exclusion ensures a single productive heavy and light chain, but literature shows that allelic exclusion fails in about 1% of cells, occasionally leading to dual expression—and in rare cases, even more than two chains per locus²⁹. Because there is no universally agreed-upon best approach for prioritizing among these assembled contigs (each criterion has its own biases), we chose a straightforward, reproducible strategy of selecting the productive chain with the highest average coverage for each cell. We note that an alternative approach is to use the number of unique molecular identifiers (UMIs) for each contig, which can also be a viable way to identify the chain most strongly supported by the data. However, like coverage-based selection, UMI-based approaches can still be influenced by technical and biological artifacts. To accommodate alternative heuristics, we will make the raw assembled chains available upon request, enabling downstream users to apply their own selection strategies if desired. Details of the procedure are laid out in the Methods section.

Database statistics

The database consists of 58 studies with 2482 single cell sequencing experiments totaling 14,401,268 productive assembled nucleotide sequences for both heavy and light chains. This yields 7,200,634 million nucleotide sequence pairs which translates to 6888961 unique nucleotides and 3666056 unique amino acid sequence pairs. The cumulative number of antibody-focused single cell projects deposited in SRA³⁰ presents notable growth (Supplementary Fig. 1). Project identification was performed in May 2024. The high rate of growth in the number of bioprojects speaks in favor of an automated identification pipeline, as developed in this and our previous study¹².

The overall ratio of lambda and kappa chains in human samples closely aligns with the established consensus ratios: 60% kappa and 40% lambda chains, as shown in Supplementary Table 1³¹. However, project-specific variations are evident in Supplementary Table 2, which details the proportions for each individual project. In rodents, there is a significant bias toward kappa chains, with observed proportions being 95% kappa and 5% lambda. Interestingly, our dataset reveals that the overall kappa to lambda ratio in mice is 90% kappa and 10% lambda (Supplementary Table 1), which is not as expected. This bias could be attributed to the PRJNA1082281 project, which focuses on extrafollicular plasmablasts, and where identified lambda chains constitute 33% of all light sequences. A closer examination of the mouse projects in Supplementary Table 4 shows that most adhere to the canonical 95:5 kappa to lambda ratio. Grouped metrics of chain proportions, such as mean and standard deviation, are presented in Supplementary Table 3 and Supplementary Table 5 for human and mouse projects, respectively.

As described by Collins and Jackson³¹, the proportion of kappa to lambda light chains varies between the species and might be dependent on the organism’s size. While the low lambda usage in rodents could be partly due to limited combinatorial diversity as only four possible rearrangements of lambda V and J genes, it is not the only reason. Germline usage frequencies are highly variable and dependent on factors such as gene accessibility and their positions within the genome. Moreover, secondary rearrangements reinforce kappa gene usage.

We manually curated metadata for each bioproject, focusing on human and mouse datasets (Fig. 1). Human sequences make up approximately 90% of the dataset (Fig. 1A), reflecting their prevalence in antibody studies. Using the RIOT program, we assigned constant region (C gene) allotypes to the antibody sequences by aligning their constant regions to known human genes. For heavy chains, the majority were assigned as IGHM, followed by IGHG1 and IGHA1, with about one million sequences remaining unassigned (Fig. 1B). For light chains, most sequences were assigned as IGKC, highlighting the prevalence of kappa light chains, while lambda chain assignments were split among IGLC2, IGLC1, and IGLC3; approximately 0.5 million sequences were unassigned (Fig. 1C).

**Fig. 1: Metadata associated with the 58 paired heavy/light bioprojects.**

Analyzing tissue sources, we found that the largest number of sequences originated from peripheral blood mononuclear cells (PBMCs), closely followed by peripheral blood samples (Fig. 1D). Significant numbers also came from lymph nodes (about 1 million sequences), bone marrow (0.7 million), and the spleen (0.3 million), with other tissues contributing roughly 0.6 million sequences. Regarding B-cell types, most sequences (roughly 6 millions) were unassigned; among the assigned, plasmablasts constituted the largest group, followed by plasma cells, reflecting the diversity of tissue sources and potential challenges in accurate cell type annotation (Fig. 1E). Lastly, when examining disease associations, the majority of sequences were unassigned; among the assigned sequences, those from healthy individuals were most prevalent, followed by sequences from COVID-19 patients, with smaller contributions from cancer patients and cases of allergic fungal rhinosinusitis (Fig. 1F).

High-level exploration of our data indicates that it is diverse in terms of sequence counts per bioprojects, and it does not have debilitating biases in terms of metadata associated with it. Using the opportunity of such a large dataset, we conducted an exploratory analysis of the pairing preferences. We focused on two aspects that were explored previously, germline-level and residue-level heavy-light pairing preferences.

Estimating the total number of antibody pairs

Given our large paired antibody sample, we aimed to estimate the total number of paired antibody sequences, similar to the way this is done to estimate the number of species on the basis of a smaller sample. This was previously done for the unpaired datasets by Briney et al.³² using the Chao2 estimator³³. Briefly, the Chao2 estimator is a non-parametric method commonly used in ecology to estimate species richness, particularly effective for datasets where many species are rare or unobserved. It accounts for unseen diversity by considering the frequency of rare occurrences—in our case, antibodies observed only once or twice.

For this estimation, we considered concatenated sequences of variable regions only from heavy and light chains, originating from human studies. With this approach, we extracted the total number of unique heavy-light paired variable regions sequences in all studies to be 3,391,926 (S_observed). Similarly the extracted number of unique sequences occurring in a single study is equal to 3,390,785 (q₁), and the number of sequences occurring twice equal to 1139 (q₂). Using the Chao2 estimator, we calculated the possible number of unique antibodies to be around 5 billion using Eq. 1.

$${S}_{{chao}2}={S}_{{observed}}+\frac{{q}_{1}^{2}}{2{q}_{2}}$$

(1)

On one hand, this appears to be a gross underestimate, given the known theoretical diversity of the variable regions. While this estimate highlights significant diversity within our dataset, it is lower than some other sources that suggest a higher magnitude of antibody variability of 10¹⁶ to 10¹⁸ ³². This underestimation could be a lower bound estimate of the richness as it has been reported for Chao2³⁴. Moreover, a small sampling ratio could influence the estimated results.

On the other hand, it is known that though there is great theoretical variability of the antibody repertoires, there is evidence that the vast sequence space might be constrained in the structural and sampling fashion^12,35,36,37. In particular, the variable regions of natural antibodies secreted by B-cells in the human body need to be significantly charged to assure their solubility, stability, and long circulation times^38,39. Furthermore, constant regions of the monoclonal antibodies are highly charged, and likewise, charged variable regions help prevent antibody aggregation⁴⁰. This could be a potential explanation of how such a vast sequence space for antibodies is efficiently restricted by our immune system to a smaller functional repertoire. As pointed out by Collins and Watson⁴¹, the human light chain repertoire exhibits a surprising lack of diversity due to several constraining factors. A primary reason is the strong bias in light chain gene usage. In the kappa light chain (IGK), just six IGKV sequences dominate reported rearrangements. Similarly, in the lambda light chain (IGLV), only three IGLV genes account for over 50% of human rearrangements, and the utilization of the four functional human IGLJ genes is notably biased. Additionally, the absence of D genes in light chain rearrangements further limits diversity.

With more paired sequences that appear to be in a steady supply (Supplementary Fig. 1), we will be able to see whether the diversity reaches a plateau. At this stage, we would interpret the results from Chao2 as informative but not conclusive.

Germline encoded pairing preferences

In humans and mice, when the initial rearrangement of the kappa light chain gene is unproductive or generates a self-reactive B-cell receptor, the B-cell can engage in additional rounds of secondary rearrangement through a process called receptor editing⁴¹. This mechanism enables the B-cell to modify its receptor specificity by replacing the problematic light chain, thereby resolving auto-reactivity. We investigated the presence of pairing preferences between the variable (V) germline genes of heavy and light chains, analyzing kappa and lambda chains separately. Germline usage stems from its accessibility and position in the genome. While some chain pairings arise due to germline usage frequencies in the dataset, we expect that receptor editing introduces a bias. Consequently, we anticipate that highly prevalent heavy-light chain pairs will exhibit lower auto-reactivity.

Firstly, we performed a Chi-squared test on the contingency table for the entire dataset, where columns and rows represented heavy and light V genes, respectively. This test confirmed significant pairing preferences in the dataset for both kappa and lambda chains (p « 0.01). To further investigate such preferences, we constructed contingency tables for each pair of heavy-light V gene subgroups. The null hypothesis was that H/L pair frequency is the same as the frequencies of H and L in the whole dataset. A detailed description can be found in the “Methods” section. This approach allowed us to assess pairing preferences at the V gene subgroup level by comparing their distributions to the remaining pairs in the dataset, following methodologies from previous studies⁴². The results are displayed as heatmaps of counts for kappa and lambda chains (Supplementary Fig. 2) as well as heatmaps of preferential pairings, representing for which gene pairs, the null hypothesis was rejected so the pairing frequencies surpasses the germline usage frequencies (Fig. 2). Because we are looking at individual pairs, we applied Bonferroni correction.

**Fig. 2: V gene subgroup pairing preferences.**

Altogether, we note that there are visible biases in the counts of certain pairs (Supplementary Fig. 2), and there exists a mixed picture of pairs that exist beyond random likelihood, and those that appear purely by chance. These findings indicate that while some pairings result from gene distribution patterns, certain pairing preferences do indeed exist.

To evaluate whether our database provides a unique and enriched perspective on germline gene usage compared to existing public repositories, we conducted a comparative analysis with the OAS paired database. Employing the chi-square contingency test, we tested the null hypothesis that the germline assignment ratios are identical between the datasets. The analysis yielded a chi-square statistic of 138,899 with 55 degrees of freedom for the human heavy chain V genes, resulting in a P-value much smaller than 0.001. Similarly, comparing kappa and lambda chains yielded p-values close to zero, with chi-square of 19,965 and 17,588 and with 33 and 39 degrees of freedom, respectively. This highly significant result indicates that the proportions of germline gene usage differ substantially between the two datasets.

Consequently, our database significantly enriches the diversity of germline genes represented, offering a broader and more comprehensive perspective of antibody germline usage than currently available in other databases. Following the Chao2 estimator findings, it is also likely true that our current dataset has intrinsic biases, and as more paired antibody datasets come in, the natural germline pairing distributions among human and murine antibodies may be refined further to contain fewer biases.

How naturally-sourced pairs compare to clinical-stage monoclonal antibodies

Comparing naturally sourced datasets such as OAS and PairedAbNGS assumes an intrinsically similar, natural sampling distribution. This is unlikely the case for monoclonal antibodies, where one can expect that the pairing preferences reflect engineering practices rather than natural preferences. While the lead sequence optimization does not involve disturbing the light and heavy chain pairing of the parent clone for the fear of losing potency, some germlines are preferred due to their favorable liability profiles⁴³.

To assess whether germline preferences in monoclonal antibodies align with those observed in natural repertoires, we plotted a heatmap that counted the usages of heavy and light chain genes in monoclonal antibodies (Fig. 3). This analysis uncovered a (known) strong bias in monoclonal antibody design toward the IGHV3-23 and IGHV1-46 germlines. The IGH3-23 contains a motif in the V-domain that is capable of protein A binding⁴⁴. This preference is likely attributed to its prominence in observed repertoires and its favorable liability profile⁴³, making it a popular choice in therapeutic antibody development.

**Fig. 3: Heatmap of V gene subgroups in monoclonal antibodies.**

Another evident bias is the preference for Kappa light chains (Fig. 3). This is expected, as many therapeutics are derived from murine parental clones, via humanization, where Kappa chains constitute about 95% of the light chains. While Lambda chains have been reported to have poorer developability profiles because of higher average hydrophobicity⁴⁵, this is a misconception. Only VL3 germline antibodies contain an aggregation prone region (APR) in the light chain FR2-CDR2 region which leads to aggregation, and can be easily fixed^46,47. Furthermore, antibodies containing Kappa chains may be more prone to self-reactivity as it was suggested by Hedda Wardemann⁴⁸. Moreover, Lambda chains may provide stability during an ongoing immune response, as their codon usage reduces the likelihood of structural changes arising from accumulating somatic point mutations⁴¹.

Overlooking lambda light chains in therapeutic antibody development can be counter-productive as it limits our ability to find potential binders against a target in the early stages and therefore the lead molecule choices of the antibody-based treatments. Adapting humanization techniques to include lambda light chains can mitigate the existing bias toward kappa chains in therapeutic antibody development, potentially leading to therapeutics with better efficacy, safety, and stability profiles that are more representative of the human immune response.

Structural analysis of contact residues on the heavy-light interface

At a finer granularity beyond germline pairing preferences, we explored residue contacts between antibody heavy and light chains as they are essential for understanding the molecular basis of their pairing specificity and overall function⁴⁹. By analyzing these interactions, we studied key residues that contribute to the stability and conformation of antibodies, which have implications for antigen binding and therapeutic antibody design.

We explored antibody structures, deposited in RCSB to identify contact residues responsible for VH/VL interfaces. A similar study was carried out by Raybould et al. using Random Forests⁴⁹. To explore those contact residues, the key interface residues, we imported antibody structures from the RCSB Protein Data Bank and extracted residues between the chains based on a distance threshold of 4.5 Å between any pair of heavy atoms from the heavy and light chains. Heavy atoms are defined as all atoms except the hydrogen. We present how often those residues are in contact (Fig. 4). Heatmap of these contacts was plotted to visualize interaction patterns (Fig. 5), and barplot/histograms for each IMGT position were generated to identify the type and frequency of conserved contact residues for heavy and light chains (Fig. 6).

**Fig. 4: Relative frequencies of residues to be in contact in RCSB Protein Data Bank.**

**Fig. 5: RCSB derived IMGT contact residues for heavy and light chains.**

**Fig. 6: Amino acid distributions for key interface residues.**

Such proximal residues might affect VH/VL packing and thus affect binding affinity to the target. Notably, we observed that the CDR3 of each chain is in contact with the FWR2 of the opposing chain. This suggests that while the FWR2 region of one chain does not typically interact with antigens, it may influence the conformation of the CDR3 loop from the other, thereby affecting antigen binding. Influence of FW2 on the conformation of CDRs is a well-known issue in humanization. Additionally, the interaction between CDR3 loops contributes to chain pairing specificity, underscoring the importance of these regions in antibody structure and function. Our findings are consistent with previous studies that have reported similar interactions^{18,19,20,21,23,50} using smaller datasets.

We compared our key interface residues to Vernier residues as defined by Foote and Winter⁵¹ for heavy and light chains (Fig. 7). Vernier residues were translated to IMGT using IMGT lookup. We observed very small overlap between the positions.

**Fig. 7: Key interface residues compared to Vernier residues in the IMGT numbering scheme.**

Building upon the key interface residues extracted from the PDB, we analyzed their amino acid distribution in our large paired sequence database. For each key residue on either heavy or light chain, we plotted the distributions of amino acids on key interface residues (Fig. 6C, D). As the distributions qualitatively appeared similar, we quantified the difference between them using Total Variation Distance, defined as half of the sum of proportions of each amino acid, as defined by Eq. 2:

$${D}_{{IMGT}}=\frac{1}{2}{\sum}_{{aa}}|{P}_{{ds}1}\left({aa}\right)-{P}_{{ds}2}({aa})|$$

(2)

Here, the distance D_IMGT, calculated for specific IMGT position, measures how the distributions of amino acids are different at that position between the datasets. P gives the proportions of a given amino acid (aa) from a datasets (e.g. PairedAbNGS, PDB). Therefore, the sum of all absolute values of differences between the proportions P in the datasets measures the absolute difference between their proportions. Please note the sum of the proportions is one at any position, for any dataset, normalizing for biases of different numbers of certain residues at different positions (Fig. 8).

**Fig. 8: Total Variation Distances for each key residue between amino acid distributions.**

We calculated such measures between PairedAbNGS and PDB (Fig. 8A), therapeutics (Fig. 8B), and OAS (Fig. 8C). Most of the values are below 0.2, which can be roughly interpreted as 80% similarity between the amino acid distributions. This similarity indicates that the contact residues identified are conserved across both structural, therapeutic, and sequence data, reinforcing their significance in contributing to heavy-light chain pairing specificity. The consistent patterns suggest that these residues play a pivotal role in maintaining the structural integrity and functional conformation of antibodies.

To further examine whether amino acid distributions on key interface residues differ between databases, we performed a series of goodness of fit chi-square tests comparing our database, monoclonal antibodies, Paired OAS, and RCSB PDB. For each pair of databases (PairedAbNGS vs other), for each key residue identified in the RCSB, we constructed contingency tables where columns represented distinct amino acids. The null hypothesis was that the distribution of amino acids on specific positions is the same; therefore, the expected value for each amino acid was calculated as the proportion of particular AA on this position in our database, multiplied by the number of data points on the same position from the database compared. Categories with less than 5 observed amino acids were binned to the “other” category, and the results were Bonferroni corrected. While histograms with Total Variation Distances revealed moderate and high similarities in some residues distributions, in the majority of cases the null hypothesis was rejected with p ~ 0 (with the exception of highly conserved residues).

In particular, when comparing key residues between OAS and PairedAbNGS, there were no IMGT positions that showed the same distributions according to Chi square tests. This means that while highly similar (Fig. 8), PairedAbNGS extends the diversity landscape of the available data, not only in germline gene usage but also in sequence variability. The observed difference underscores that PairedAbNGS data can reshape our understanding of antibody diversity, with downstream implications for modeling, engineering, and applying immunoglobulin repertoires in both research and clinical contexts.

We investigated the inter-chain amino acid pairs of the previously identified pairs of key interface residues to gain deeper insights into their contributions to heavy-light chain pairing specificity. For each pair of IMGT positions corresponding to these key residues on the heavy and light chains, we calculated the Total Variation Distance of amino acid pairs. The Total Variation Distance served as a measure of the difference between the distribution of amino acid pairs in our dataset and the distribution expected if heavy and light chains were randomly paired. By plotting these results as a heatmap (Fig. 9), we visualized the extent of these differences across all residue pairs. Generally, the differences were small, likely due to random pairings being a very poor proxy for ‘incorrect pairing’ as they occasionally replicated correct pairings present in the actual dataset. However, we observed notable differences reaching up to 0.06 in Total Variation Distance values, indicating that the observed amino acid pairings differ significantly from random expectations at certain positions. In particular, the majority of significant differences were attributed to the CDR3 region of the heavy chain, reinforcing its critical role in chain pairing specificity. This finding suggests that specific amino acid interactions involving the heavy chain CDR3 region are key determinants in the proper pairing of antibody chains. This aligns with work of Monica L. Fernández-Quintero et al.⁵² who found that specific residues within the CDR-H3 and CDR-L3 loops can determine pairing preferences.

**Fig. 9: Total Variation Distances for each pair of key residues between amino acid distributions.**

Having identified the proximal contact of CDR3 and FWR2 from the pairing chains (Fig. 5), we explored the pairing preferences of inter-chain V and J genes (Fig. 10). Similar to Supplementary Fig. 2, we observe a strong bias towards some pairings (p-value ~ 0) which also might indicate pairing preferences arising from receptor editing in which autoreactive antibodies are removed. It is important to note, however, that while we observe a notable bias in J gene usage towards IGHJ4, it can also be attributed to recombination signal sequence (RSS) as it was investigated by Bin Shi et al.⁵³.

**Fig. 10: Inter-chain V-J germline pairing frequencies.**

Discussion

We introduce PairedAbNGS, which to our knowledge, is the largest available dataset of paired heavy and light chain antibody sequences. This extensive dataset serves as a valuable resource to the community for several applications including repertoire analyses, biologic drug discovery, and machine learning applications in antibody design and engineering. By providing a comprehensive collection of paired sequences, PairedAbNGS enables Machine Learning models to learn the intricate relationships between heavy and light chains, leading to more accurate functional antibody predictions and the development of more effective antibody libraries with higher sensitivity and specificity along with desirable physicochemical attributes.

We examined pairing preferences between the variable (V) genes of heavy and light chains and identified significant biases toward certain pairs that go beyond gene usage frequencies alone. Using Chi-square contingency tests, we confirmed that specific heavy-light chain combinations are favored, likely due to receptor editing favoring combinations of stable molecules, with lower autoreactivity.

Exploring the key regions and residues at the heavy and light chain interfaces, we noted that the framework region 2 (FWR2) from one chain frequently contacts the complementarity-determining region 3 (CDR3) of the pairing chain. This observation led us to investigate pairing preferences for VH-JL (heavy chain variable region and light chain joining region) and VL-JH (light chain variable region and heavy chain joining region) gene segments. We identified statistically significant biases in these pairings, indicating that specific inter-chain V and J gene combinations contribute to stability and functionality of the antibody molecule.

These biases have important implications for antibody engineering, particularly in the humanization of monoclonal antibodies and the better utilization of lambda chains, which are reported to have lower autoreactivity. Heavy and light chain pairing is crucial for maintaining the correct antibody structure and function. Therefore during humanization, it is essential to consider not only the CDRs but also these key interface residues that contribute to chain pairing specificity and stability. Understanding these pairing preferences and inter-chain interactions can inform strategies to minimize autoreactivity and improve the efficacy of therapeutic antibodies. Our findings highlight the importance of considering heavy-light chain pairing biases and interfacial residues in antibody design and engineering.

As we focused in our analysis on static structures deposited in RCSB, it is important to notice that CDR3 loops are flexible and their lengths are dependent on VH, VL, and JH gene usage⁵⁴. It is therefore important that the pairing preferences, the nature of inter-chain contacts, and its impact on antigen recognition should be investigated using conformational ensembles obtained from molecular dynamics simulations as it was introduced by Monica L Fernández-Quintero et al. The authors showed that varying light and heavy pairings impacts CDR conformations. This leads to changes in heavy/light chain orientations which further impacts antibody structure^55,56.

Considering the potential of transcriptomic analysis as a future extension of our database could significantly enhance our understanding of antibody development and maturation. Some of the bioprojects within our database contain sequencing experiments that allow for the determination of protein expression levels. Integrating transcriptomic data would enable us to identify biomarkers relevant to antibody maturation and immune responses, which could be instrumental in discovering antigen-specific antibodies directly. Although we did not incorporate this analysis in the current study due to incompatibilities among data sources, future inclusion of transcriptomic information could provide deeper insights into the regulatory mechanisms governing antibody expression. This expansion would not only enrich the database but also offer a valuable resource for researchers aiming to explore the interplay between antibody structure, expression levels, and immune functionality.

Methods

Identification of bioprojects

NCBI SRA⁵⁷ was used as the primary source of the sequences. Metadata for all bioprojects was imported from European Bioinformatics Institute European Nucleotide Archive (EBI ENA) portal⁵⁸.

We extracted sequencing runs from NCBI SRA metadata where the library source was defined as “TRANSCRIPTOMIC SINGLE CELL”. For each unique bioproject, we imported its description from EBI ENA. In order to identify antibody-related projects, we constructed an LLM prompt and submitted it to the Gemma language model. Finally, the remaining projects were manually reviewed to discard studies incorrectly flagged as antibody-related. Furthermore, we applied the following inclusion criteria: 10x Chromium 5’ library, human or mouse, known data layout.

Prompt engineering⁵⁹ is an evolving field focused on optimizing interactions with large language models to achieve specific outcomes. It involves crafting precise prompts to guide LLMs in generating accurate and context-appropriate responses, utilizing techniques such as chain of thought and emotion prompts to enhance problem-solving and emotional intelligence. Additionally, few-shot learning and self-reflection prompts are used to help LLMs quickly adapt to new tasks and improve the quality of their outputs. We have applied the above-mentioned prompt engineering patterns to create the prompt given in Listing 1.

NCBI SRA (Sequence Read Archive) is a public database consisting of various NGS studies. You are trying to scan existing studies (or bioprojects) to determine if they focus on antibodies or b-cells.

You will be provided paper abstract from NCBI SRA related bioproject. Your task is to

* provide short 1-2 sentences summarization of the abstract

* determine if the project contains on antibodies or b-cells. On the top of directly mentioned b-cells and antibodies in the abstract, projects investigating immune responses or antigens or vaccines should be considered as projects focusing on antibodies or b-cells.

* justify your answer with a short explanation. think step by step

* determine if the project directly focuses on antibodies or b-cells or if it is indirectly related to them.

* provide confidence level of your answer (low, medium, high)

Please provide the response in plain json format, for example given the following abstract:

Single cell sequencing for analysis gene of gene transcription and Ig repertoires of isotype-switched IgG-expressing memory B cells for paired analysis with Ig repertoires in the same cells. Aim of the study was to explore heterogeneity of isotype-switched memory B cells within and between spleen and bone marrow compartments. Overall design: Sequencing of RNA and Ig was performed in duplicates of IgG + B cells isolated from spleen and bone marrow of C57BL/6 mice.

The response would be:

{

“summary”: “Single cell sequencing of human BCR”,

“contains_antibodies”: “True”,

“justification”: “b-cells are mentioned in the abstract …”,

“relation”: “direct”,

“confidence_level”: “high”

}

Another example:

This study investigated the immune responses induced by an NS1-deleted influenza virus vectored intranasal COVID-19 vaccine (dNS1-RBD) in C57BL/6 mice, focusing on the single-cell RNA sequencing of lung tissue at different time points post-vaccination/infection.

Response:

{

“summary”: “This study investigated the immune responses induced by an NS1-deleted influenza virus vectored intranasal COVID-19 vaccine (dNS1-RBD) in C57BL/6 mice”,

“contains_antibodies”: “True”,

“justification”: “study investigated immune responses which and b-cells and antibodies are part of immune responses”,

“relation”: “indirect”,

“confidence_level”: “high”

}

This is very important for my career, I will become homeless if you make an error.

The response should contain only json contents string, no other information should be present in the response.

Listing 1. Bioproject verification prompt

This filtered the results to ca. 200 antibody-related bioprojects. Such a number was already manageable for further manual review that discarded false positives as well as cases that were not suitable for downstream analysis (e.g. species with projects for which we do not have a germline database). We focused on projects where a 10x Chromium 5’ library preparation pipeline was used.

This way we identified 58 studies, consisting of 2482 sequencing experiments with a total over 14 million assembled productive sequences constituting more than 7 million paired heavy and light sequences.

Reads were converted to FASTQ format using sra-toolkit after which they were extracted into separate files and assembled using TRUST4. Finally assembled consensus sequences were annotated using RIOT to IMGT scheme. The OGRDB database was used as a primary source for human and mouse VDJ genes in the RIOT annotation pipeline.

Processing pipeline

For each identified project, each read run is extracted to multiple fastq files using the sra-toolkit.

We use TRUST4 script to extract the reads to separate files containing barcodes, UMIs, and biological reads, respectively. This process is customized for each study, according to the read layout imported from NCBI trace server or library construction protocol described in study related paper (if present). From the existing bioprojects, we identified 2 groups of read layouts: single—where biological sequence was present on only one read—and paired.

In a single layout, read 1 contains the cell barcode and unique molecular identifier (UMI), which are crucial for identifying the cell of origin and removing PCR duplicates, respectively. Read 2 captures the actual cDNA sequence, which corresponds to the transcribed RNA from a particular gene⁶⁰.

In paired layout, the length of read 1 is extended to cover both the cell barcode and a portion of the transcript sequence, a modification referred to as “10 × 5’ with extended R1”⁶⁰. This approach can increase the number of unambiguous reads spanning important junctions, such as those found in viral subgenomic mRNAs, potentially improving the accuracy and depth of single-cell transcriptome analysis.

Distinct versions of 10x chemistry have different lengths of 10x barcodes and UMIs. While in single read layout, the length of the UMIs and barcodes is determined by read length, in paired reads we inferred it by localizing conserved Template Switch Oligo (TSO) on read 1.

Extracted reads are repartitioned by barcode to multiple directories to improve the parallelisation of the calculations, and TRUST4 is used again to assemble consensus sequences for each barcode. Finally, consensus sequences are annotated using RIOT.

Germline-encoded pairing preferences

To explore pairing preference between V germlines of heavy and light chains for kappa and lambda chains separately, we grouped sequences by the V gene subgroup, according to germlines assigned by RIOT. Since OGRDB mouse germlines do not have a straightforward group/subgroup/allele hierarchy, we focused only on human projects. We constructed a contingency table containing counts of V gene subgroups frequencies for heavy and light chains which was an input to Chi-square test. Pairs with observed co-occurrence of less than 5 were filtered out.

To investigate pairing preferences at higher granularities, for each pair of hl, we constructed the following contingency table:

$${hl\,h}^{\prime} l$$

$${hl}^{\prime} \,h^{\prime} l^{\prime}$$

where ${hl}$ is the count of occurrences of heavy($h$) and light($l$) V gene subgroups in the dataset, and $h^{\prime}$ and $l^{\prime}$ represents is the number of all other V genes subgroups, except $h$ and $l$ respectively. The null hypothesis is that the frequency of ${hl}$ stems from the frequencies of $h$ and $l$ in the dataset, therefore, expected values arises from proportion of ${hl}$, $h^{\prime} l$, ${hl}^{\prime}$ and $h^{\prime} l^{\prime}$. Significance levels were adjusted with Bonferroni correction, by dividing them by the number of tests performed.

IMGT key interface residues

We imported protein structures from the RCSB Protein Data Bank and numbered them to the IMGT scheme to identify antibody sequences. To accurately pair heavy and light chains, we measured the alpha carbon (Cα) distances of cysteines located at IMGT position 104 in both chains, pairing those within the distance of 20 Å, due to their conserved disulfide bonds. We extracted contact residues between the chains based on a criterion where any pair of heavy atoms (excluding hydrogens) from the heavy and light chains were within a maximum distance of 4.5 Å.

OAS database preprocessing

The OAS dataset was preprocessed using RIOT to ensure consistent and comparable germline V gene assignments across both databases. We grouped the assigned V genes into their respective subgroups and constructed a contingency table where each row represented a V gene subgroup and each column represented one of the two datasets.

Therapeutic database preprocessing

Manually collected therapeutic data was filtered to retain only standard monoclonal antibody formats therapeutics (monospecific antibodies with two pairs of heavy and light sequences). Since many therapeutic molecules originate from rodents with varying levels of humanization, sequences were annotated using RIOT using organism detection mode, so that the closest germline can be assigned, regardless of the level of mouse sequence content.

Statistics and reproducibility

To investigate the germline pairing preferences at high level, Chi-squared contingency tests were performed using the chi2_contingency method from scipy.stats python package. Only human sequences were used, with the number of occurrences greater or equal to 5. This yielded a total n = 2607209 and n = 3958735 heavy—lambda and heavy—kappa chain pairs respectively.

The same filtering criteria were applied to investigate the pairing preferences on higher granularity - only human sequences with at least 5 observed sequences in the dataset were considered. The number of occurrences of each gene pair in the dataset is available in the Supplementary Data 1.xlsx file.

To compare the pairing distribution of PairedAbNGS to OAS we performed a chi-squared test, comparing the distribution of the frequencies of V gene subgroups in both databases. The frequencies can be obtained by grouping the pairing frequencies available in the Supplementary Data 1.xlsx file.

To compare the distributions of amino acids on key contact residues between the databases we performed a series of chi-squared tests. In cases where there were less than 5 occurrences of given amino acid at specific IMGT position, the counts were binned to the “other” category. The sample sizes for each test can be easily derived from the residues in contact frequencies from the Supplementary Data 1.xlsx file.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Data availability

PairedAbNGS is available from https://naturalantibody.com/paired-ab-ngs/ for non-commercial use by non-commercial organizations. Numerical data used to generate images is available in the supplemental Excel file (Supplementary Data 1.xlsx).

Change history

28 August 2025
In this article the affiliation details for Author Ponraj Prabakaran were incorrectly given as '3Institute for Protein Innovation, Boston, USA' but should have been '3Molecule design and modeling group, Platform Research, Moderna, Cambridge, MA, USA and 4Present Address: Institute for Protein Innovation, Boston, USA.'. The original article has been corrected.

References

Senior, M. Fresh from the biotech pipeline: record-breaking FDA approvals. Nat. Biotechnol. 42, 355–361 (2024).
Article CAS PubMed Google Scholar
Gray, A. et al. Animal-free alternatives and the antibody iceberg. Nat. Biotechnol. 38, 1234–1239 (2020).
Article CAS PubMed Google Scholar
Stephens, A. D. & Wilkinson, T. Discovery of therapeutic antibodies targeting complex multi-spanning membrane proteins. BioDrugs 38, 769–794 (2024).
Article CAS PubMed PubMed Central Google Scholar
Vidyasagar, M. A Theory of Learning and Generalization: With Applications to Neural Networks and Control Systems. (Springer, 1996).
Wilman, W. et al. Machine-designed biotherapeutics: opportunities, feasibility and advantages of deep learning in computational antibody discovery. Brief. Bioinform. 23, bbac267 (2022).
Bauer, J. et al. How can we discover developable antibody-based biotherapeutics? Front. Mol. Biosci. 10, 1221626 (2023).
Article CAS PubMed PubMed Central Google Scholar
Porebski, B. T. et al. Rapid discovery of high-affinity antibodies via massively parallel sequencing, ribosome display and affinity screening. Nat. Biomed. Eng. 8, 214–232 (2024).
Article CAS PubMed Google Scholar
Stavnezer, J. & Schrader, C. E. IgH chain class switch recombination: mechanism and regulation. J. Immunol. 193, 5370–5378 (2014).
Article CAS PubMed Google Scholar
Enzelberger, M., Prassler, J., Urlinger, S., Herrmann, T. & Tiller, T. A collection of VH and VL pairs having favourable biophysical properties and methods for its use. Patent (2016).
Kovaltsuk, A. et al. Observed antibody space: a resource for data mining next-generation sequencing of antibody repertoires. J. Immunol. 201, 2502–2509 (2018).
Article CAS PubMed Google Scholar
Jaffe, D. B. et al. Functional antibodies exhibit light chain coherence. Nature 611, 352–357 (2022).
Article CAS PubMed PubMed Central Google Scholar
Dudzic, P. et al. Large-scale data mining of four billion human antibody variable regions reveals convergence between therapeutic and natural antibodies that constrains search space for biologics drug discovery. MAbs 16, 2361928 (2024).
Article PubMed PubMed Central Google Scholar
Pearson, H. C. L. et al. The promise of single-cell technology in providing new insights into the molecular heterogeneity and management of acute lymphoblastic leukemia. Hemasphere 6, e734 (2022).
Article CAS PubMed PubMed Central Google Scholar
Zhang, R. et al. A platform-agnostic, function first-based antibody discovery strategy using plasmid-free mammalian expression of antibodies. MAbs 13, 1904546 (2021).
Article PubMed PubMed Central Google Scholar
Rajan, S. et al. Recombinant human B cell repertoires enable screening for rare, specific, and natively paired antibodies. Commun. Biol. 1, 5 (2018).
Article PubMed PubMed Central Google Scholar
Wang, B. et al. Functional interrogation and mining of natively paired human VH:VL antibody repertoires. Nat. Biotechnol. 36, 152–155 (2018).
Article PubMed PubMed Central Google Scholar
Dondelinger, M. et al. Understanding the significance and implications of antibody numbering and antigen-binding surface/residue definition. Front. Immunol. 9, 2278 (2018).
Article PubMed PubMed Central Google Scholar
Fernández-Quintero, M. L. et al. Antibodies exhibit multiple paratope states influencing VH-VL domain orientations. Commun. Biol. 3, 589 (2020).
Article PubMed PubMed Central Google Scholar
Abhinandan, K. R. & Martin, A. C. R. Analysis and prediction of VH/VL packing in antibodies. Protein Eng. Des. Sel. 23, 689–697 (2010).
Article CAS PubMed Google Scholar
Dunbar, J., Fuchs, A., Shi, J. & Deane, C. M. ABangle: characterising the VH-VL orientation in antibodies. Protein Eng. Des. Sel. 26, 611–620 (2013).
Article CAS PubMed Google Scholar
Bujotzek, A. et al. Prediction of VH-VL domain orientation for antibody variable domain modeling: Prediction of VH-VL domain orientation. Proteins 83, 681–695 (2015).
Article CAS PubMed Google Scholar
Bujotzek, A. et al. VH-VL orientation prediction for antibody humanization candidate selection: a case study. MAbs 8, 288–305 (2016).
Article CAS PubMed Google Scholar
Boron, V. A. & Martin, A. C. R. abYpap: improvements to the prediction of antibody VH/VL packing using gradient boosted regression. Protein Eng. Des. Sel. 36, gzad021 (2023).
Article PubMed PubMed Central Google Scholar
Gemini Team et al. Gemini: A family of highly capable multimodal models. arXiv [cs.CL] (2023).
Gemma Team et al. Gemma: open models based on Gemini research and technology. arXiv [cs.CL] (2024).
Song, L. et al. TRUST4: immune repertoire reconstruction from bulk and single-cell RNA-seq data. Nat. Methods 18, 627–630 (2021).
Article CAS PubMed PubMed Central Google Scholar
Dudzic, P. et al. RIOT—Rapid Immunoglobulin Overview Tool - annotation of nucleotide and amino acid immunoglobulin sequences using an open germline database. Brief. Bioinform. 26, bbae632 (2024).
Lees, W. et al. OGRDB: a reference database of inferred immune receptor genes. Nucleic Acids Res. 48, D964–D970 (2020).
Article CAS PubMed Google Scholar
Ralph, D. K. & Matsen, F. A. 4th Inference of B cell clonal families using heavy/light chain pairing information. PLoS Comput. Biol. 18, e1010723 (2022).
Article CAS PubMed PubMed Central Google Scholar
Sayers, E. W. et al. Database resources of the National Center for Biotechnology Information. Nucleic Acids Res. 50, D20–D26 (2022).
Article CAS PubMed Google Scholar
Collins, A. M. & Jackson, K. J. L. On being the right size: antibody repertoire formation in the mouse and human. Immunogenetics 70, 143–158 (2018).
Article CAS PubMed Google Scholar
Briney, B., Inderbitzin, A., Joyce, C. & Burton, D. R. Commonality despite exceptional diversity in the baseline human antibody repertoire. Nature 566, 393–397 (2019).
Article PubMed PubMed Central Google Scholar
Gotelli, N. J. & Colwell, R. K. Estimating species richness. in 12 39–54 (unknown, 2011).
Chiu, C.-H. A species richness estimator for sample-based incidence data sampled without replacement. Methods Ecol. Evol. 14, 2482–2493 (2023).
Article Google Scholar
Rees, A. R. Understanding the human antibody repertoire. MAbs 12, 1729683 (2020).
Article PubMed PubMed Central Google Scholar
Krawczyk, K., Raybould, M. I. J., Kovaltsuk, A. & Deane, C. M. Looking for therapeutic antibodies in next-generation sequencing repositories. MAbs 11, 1197–1205 (2019).
Article PubMed PubMed Central Google Scholar
Krawczyk, K. et al. Structurally mapping antibody repertoires. Front. Immunol. 9, 1698 (2018).
Article PubMed PubMed Central Google Scholar
Reddy, S. T. et al. Monoclonal antibodies isolated without screening by analyzing the variable-gene repertoire of plasma cells. Nat. Biotechnol. 28, 965–969 (2010).
Article CAS PubMed Google Scholar
Lee, C.-H. et al. An engineered human Fc domain that behaves like a pH-toggle switch for ultra-long circulation persistence. Nat. Commun. 10, 5031 (2019).
Article PubMed PubMed Central Google Scholar
Li, L. et al. Concentration dependent viscosity of monoclonal antibody solutions: explaining experimental behavior in terms of molecular properties. Pharm. Res. 31, 3161–3178 (2014).
Article CAS PubMed Google Scholar
Collins, A. M. & Watson, C. T. Immunoglobulin light chain gene rearrangements, receptor editing and the development of a self-tolerant antibody repertoire. Front. Immunol. 9, 2249 (2018).
Article PubMed PubMed Central Google Scholar
Jayaram, N., Bhowmick, P. & Martin, A. C. R. Germline VH/VL pairing in antibodies. Protein Eng. Des. Sel. 25, 523–529 (2012).
Article CAS PubMed Google Scholar
Satława, T. et al. LAP: Liability Antibody Profiler by sequence & structural mapping of natural and therapeutic antibodies. PLoS Comput. Biol. 20, e1011881 (2024).
Article PubMed PubMed Central Google Scholar
Graille, M. et al. Crystal structure of a Staphylococcus aureus protein A domain complexed with the Fab fragment of a human IgM antibody: structural basis for recognition of B-cell receptors and superantigen activity. Proc. Natl. Acad. Sci. USA 97, 5399–5404 (2000).
Article CAS PubMed PubMed Central Google Scholar
Raybould, M. I. J., Turnbull, O. M., Suter, A., Guloglu, B. & Deane, C. M. Contextualising the developability risk of antibodies with lambda light chains using enhanced therapeutic antibody profiling. Commun. Biol. 7, 62 (2024).
Article CAS PubMed PubMed Central Google Scholar
Nichols, P. et al. Rational design of viscosity reducing mutants of a monoclonal antibody: hydrophobic versus electrostatic inter-molecular interactions. MAbs 7, 212–230 (2015).
Article CAS PubMed PubMed Central Google Scholar
Kumar, S. et al. Rational optimization of a monoclonal antibody for simultaneous improvements in its solution properties and biological activity. Protein Eng. Des. Sel. 31, 313–325 (2018).
Article CAS PubMed Google Scholar
Wardemann, H., Hammersen, J. & Nussenzweig, M. C. Human autoantibody silencing by immunoglobulin light chains. J. Exp. Med. 200, 191–199 (2004).
Article CAS PubMed PubMed Central Google Scholar
Raybould, M. I. J. et al. Public Baseline and shared response structures support the theory of antibody repertoire functional commonality. PLoS Comput. Biol. 17, e1008781 (2021).
Article CAS PubMed PubMed Central Google Scholar
Cisneros, A. et al. Role of antibody heavy and light chain interface residues in affinity maturation of binding to HIV envelope glycoprotein. Mol. Syst. Des. Eng. 4, 737–746 (2019).
Article CAS PubMed PubMed Central Google Scholar
Foote, J. & Winter, G. Antibody framework residues affecting the conformation of the hypervariable loops. J. Mol. Biol. 224, 487–499 (1992).
Article CAS PubMed Google Scholar
Fernández-Quintero, M. L. et al. CDR loop interactions can determine heavy and light chain pairing preferences in bispecific antibodies. MAbs 14, 2024118 (2022).
Article PubMed PubMed Central Google Scholar
Shi, B. et al. The usage of human IGHJ genes follows a particular non-random selection: The recombination signal sequence may affect the usage of human IGHJ genes. Front. Genet. 11, 524413 (2020).
Article CAS PubMed PubMed Central Google Scholar
Sankar, K., Hoi, K. H. & Hötzel, I. Dynamics of heavy chain junctional length biases in antibody repertoires. Commun. Biol. 3, 207 (2020).
Article CAS PubMed PubMed Central Google Scholar
Fernández-Quintero, M. L., Georges, G., Varga, J. M. & Liedl, K. R. Ensembles in solution as a new paradigm for antibody structure prediction and design. MAbs 13, 1923122 (2021).
Article PubMed PubMed Central Google Scholar
Fernández-Quintero, M. L. et al. Germline-dependent antibody paratope states and pairing specific VH-VL interface dynamics. Front. Immunol. 12, 675655 (2021).
Article PubMed PubMed Central Google Scholar
Leinonen, R., Sugawara, H. & Shumway, M.International Nucleotide Sequence Database Collaboration The sequence read archive. Nucleic Acids Res. 39, D19–D21 (2011).
Article CAS PubMed Google Scholar
Cummins, C. et al. The European Nucleotide Archive in 2021. Nucleic Acids Res. 50, D106–D110 (2022).
Article CAS PubMed Google Scholar
White, J. et al. A prompt pattern catalog to enhance prompt engineering with ChatGPT. arXiv [cs.SE] (2023).
Cohen, P. et al. Unambiguous detection of SARS-CoV-2 subgenomic mRNAs with single-cell RNA sequencing. Microbiol. Spectr. 11, e0077623 (2023).
Article PubMed Google Scholar

Download references

Author information

Authors and Affiliations

NaturalAntibody, Szczecin, Poland
Pawel Dudzic, Dawid Chomicz, Weronika Bielska, Igor Jaszczyszyn, Michał Zieliński, Bartosz Janusz, Sonia Wróbel & Konrad Krawczyk
Sanofi, Babraham Research Campus, Cambridge, UK
Marguerite-Marie Le Pannérer & Andrew Philips
Molecule design and modeling group, Platform Research, Moderna, Cambridge, MA, USA
Ponraj Prabakaran & Sandeep Kumar
Institute for Protein Innovation, Boston, USA
Ponraj Prabakaran

Authors

Pawel Dudzic
View author publications
Search author on:PubMed Google Scholar
Dawid Chomicz
View author publications
Search author on:PubMed Google Scholar
Weronika Bielska
View author publications
Search author on:PubMed Google Scholar
Igor Jaszczyszyn
View author publications
Search author on:PubMed Google Scholar
Michał Zieliński
View author publications
Search author on:PubMed Google Scholar
Bartosz Janusz
View author publications
Search author on:PubMed Google Scholar
Sonia Wróbel
View author publications
Search author on:PubMed Google Scholar
Marguerite-Marie Le Pannérer
View author publications
Search author on:PubMed Google Scholar
Andrew Philips
View author publications
Search author on:PubMed Google Scholar
Ponraj Prabakaran
View author publications
Search author on:PubMed Google Scholar
Sandeep Kumar
View author publications
Search author on:PubMed Google Scholar
Konrad Krawczyk
View author publications
Search author on:PubMed Google Scholar

Contributions

P.D. imported sequencing data, constructed a PoC pipeline to assemble raw reads and annotate them using RIOT, performed analyses, and wrote the initial version of the manuscript. D.C. Imported RCSB data, deployed PoC pipeline to cloud infrastructure. W.B., I.J. and M.Z. annotated sequencing metadata and verified outcomes from language models. M.-M.L.P. provided technical support for single cell read layout configurations in various projects, reviewed the manuscript, and provided constructive comments. B.J., S.W. helped with data interpretation, reviewed the manuscript, and provided constructive comments. A.P., P.P., S.K. reviewed the manuscript and provided constructive comments. K.K. Supervised work, helped with data interpretation, reviewed the manuscript, and provided constructive comments.

Corresponding authors

Correspondence to Pawel Dudzic or Konrad Krawczyk.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Communications Biology thanks Timothy Jenkins and the other anonymous reviewer(s) for their contribution to the peer review of this work. Primary Handling Editors: Theam Soon Lim and Laura Rodriguez-Perez.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information

Description of Additional Supplementary files

Supplementary Data 1

Reporting Summary

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Dudzic, P., Chomicz, D., Bielska, W. et al. Conserved heavy/light contacts and germline preferences revealed by a large-scale analysis of natively paired human antibody sequences and structural data. Commun Biol 8, 1110 (2025). https://doi.org/10.1038/s42003-025-08388-y

Download citation

Received: 04 January 2025
Accepted: 13 June 2025
Published: 26 July 2025
Version of record: 26 July 2025
DOI: https://doi.org/10.1038/s42003-025-08388-y