Introduction

Therapeutic antibodies are rapidly emerging as the most successful class of biotherapeutics, with their yearly number of regulatory approvals reaching the same levels as the small molecule drugs1. However, the initial generation of antibodies against a given antigen primarily requires the use of experimental methods, such as animal immunization, hybridomas, or phage/yeast display libraries2.

In addition to being costly and time-consuming, experimental methods only work well for antigens that are “well-behaved” under laboratory conditions3. That is, their expression yields are high, in addition to them being conformationally stable and highly soluble in vitro. In the case of challenging targets such as GPCRs, claudins and other membrane-associated proteins that demonstrate conformational instability and/or low solubility outside of their cellular milieu, the task of producing sufficient amounts of antigens to initiate experimental generation of antibody binders becomes daunting.

To overcome these limitations and to expand the antigen space druggable by biotherapeutics to include the antigens where the experimental approaches may not work well, it is crucial to explore novel routes of antibody generation. Computation can be very helpful in this regard. This is because the availability of large antibody repertoires via next-generation sequencing (NGS) has coincided with the immense success of methods in machine learning theory and generalizations4,5. Drawing from publicly available antibody sequencing data, either by data mining or machine learning modeling, it is now feasible to devise small and focused antigen-specific as well as larger antigen-agnostic antibody libraries6,7. However, this endeavor requires careful construction and curation of underlying antibody sequences for deep learning, along with benchmarking different generative AI algorithms for their suitability towards specific biological purposes.

Paired antibody sequence databases are crucial for advancing our understanding of how immune responses are initiated and developed in the human body when challenged by an antigen. A crucial question is as follows: Development of IgG antibodies against an antigen requires class switching (IgM → IgG) and affinity maturation8. In the realm of antibody engineering, affinity maturation is typically understood to rely on processes such as V(D)J recombination and somatic hypermutation (Mishra and Mariuzza, 2018). However, distinct from the natural germinal center response, it remains an open question whether this engineered approach also benefits from the deliberate selection of specific germline pairings between the light and heavy chains9?. This question has been so far neglected because of lack of data available for naive B-cell and immunized antibody repertoires. Understanding this question is crucial for developing therapeutic antibodies and improving reagent and diagnostic antibodies. However, existing databases face limitations in diversity, data quality, and accessibility. The original version of OAS only contained unpaired reads10. Subsequent update collated ca. 120k paired sequences from five studies. The resource is regularly updated; however, certain studies such as one by Jaffe et al. account for most of its diversity11. The presence of memory cells and the phenomenon of functional coherence can further limit the overall diversity represented in the dataset. We have recently demonstrated that by automatic mining of public repositories, it is possible to identify a large number of antibody-containing depositions12. Though at the time, because of the technical simplicity of processing unpaired reads and the corresponding complexity of doing so with the paired datasets, we only made unpaired data available.

To address the lack of diverse paired antibody data, we created a database of human and mouse-paired antibody sequences (PairedAbNGS) derived from publicly available data in the NCBI Sequence Read Archive (SRA), focusing specifically on projects utilizing high throughput single cell analyses via the 10x Chromium 5’ pipeline. Single-cell technology has revolutionized our ability to capture and analyze individual cells, providing unprecedented insights into cellular heterogeneity and function13. In the context of antibody research, single-cell RNA sequencing (scRNA-seq) enables the simultaneous capture of paired heavy and light chain sequences from individual B-cells, overcoming the limitations of bulk sequencing approaches14. Previously, barcoded technology microfluidic methods for single-B-cell isolation led to the identification of millions of paired human antibody sequences, as demonstrated in studies by Rajan et al.15 and Wang et al.16. However, the actual number of unique sequences recovered may be influenced by factors such as the diversity of the B-cell population being studied, the efficiency of the isolation and sequencing processes, and the depth of sequencing coverage. The 10x Chromium 5’ pipeline, in particular, offers high-throughput single-cell analysis with improved full-length transcript coverage, making it an ideal platform for generating paired antibody sequences. By leveraging this technology and curating data from multiple studies, our database aims to provide the clearest landscape to date on the natural pairing of heavy and light chains among antibodies.

It is known that heavy/light preferences affect the binding of the antibodies to antigens17,18, with multiple computational studies on the smaller datasets. Via such studies certain genetic preferences, as well as key residues on the interface affecting the VH/VL conformation were proposed19,20,21,22,23. In addition, heavy and light chain pairings belonging to IGHV3 and IGKV1 genetic loci are most frequent among the marketed antibody-based biotherapeutics.

Leveraging the largest paired antibody collection to date, we have studied the pairing preferences at the gene and individual residue levels. We explored inter-chain interaction patterns, where we observed that the CDR3 of each chain is in contact with the FWR2 of the opposing chain. We hope that our analysis and the dataset will act as a good basis for future studies of pairing preferences and will aid in antibody engineering.

Results

Overview of database curation

NCBI SRA served as the primary source of sequencing data, with metadata imported from EBI ENA in May 2024. We focused on projects containing sequencing experiments labeled as “TRANSCRIPTOMIC SINGLE CELL” and used a large language model (Gemini) to identify antibody-related studies24. This process involved creating a specific prompt for the Gemma language model25 and manually verifying the results. Only projects using the 10x Chromium 5’ library preparation and focusing on human or mouse data were included. We identified 58 suitable studies containing 2482 sequencing experiments, resulting in over 14 million assembled productive sequences. Reads were processed using sra-toolkit and TRUST426, with annotation performed using the RIOT27 program with OGRDB28 as a primary source of gene references. The extracted data were divided by barcodes, and only productive (without the stop codon in translation and frameshift between V and J gene segments) chains were retained as annotated by RIOT flag “productive”. Sequences were sorted by coverage, and only the top heavy and light chain pair for each barcode was kept. In most cases, allelic exclusion ensures a single productive heavy and light chain, but literature shows that allelic exclusion fails in about 1% of cells, occasionally leading to dual expression—and in rare cases, even more than two chains per locus29. Because there is no universally agreed-upon best approach for prioritizing among these assembled contigs (each criterion has its own biases), we chose a straightforward, reproducible strategy of selecting the productive chain with the highest average coverage for each cell. We note that an alternative approach is to use the number of unique molecular identifiers (UMIs) for each contig, which can also be a viable way to identify the chain most strongly supported by the data. However, like coverage-based selection, UMI-based approaches can still be influenced by technical and biological artifacts. To accommodate alternative heuristics, we will make the raw assembled chains available upon request, enabling downstream users to apply their own selection strategies if desired. Details of the procedure are laid out in the Methods section.

Database statistics

The database consists of 58 studies with 2482 single cell sequencing experiments totaling 14,401,268 productive assembled nucleotide sequences for both heavy and light chains. This yields 7,200,634 million nucleotide sequence pairs which translates to 6888961 unique nucleotides and 3666056 unique amino acid sequence pairs. The cumulative number of antibody-focused single cell projects deposited in SRA30 presents notable growth (Supplementary Fig. 1). Project identification was performed in May 2024. The high rate of growth in the number of bioprojects speaks in favor of an automated identification pipeline, as developed in this and our previous study12.

The overall ratio of lambda and kappa chains in human samples closely aligns with the established consensus ratios: 60% kappa and 40% lambda chains, as shown in Supplementary Table 131. However, project-specific variations are evident in Supplementary Table 2, which details the proportions for each individual project. In rodents, there is a significant bias toward kappa chains, with observed proportions being 95% kappa and 5% lambda. Interestingly, our dataset reveals that the overall kappa to lambda ratio in mice is 90% kappa and 10% lambda (Supplementary Table 1), which is not as expected. This bias could be attributed to the PRJNA1082281 project, which focuses on extrafollicular plasmablasts, and where identified lambda chains constitute 33% of all light sequences. A closer examination of the mouse projects in Supplementary Table 4 shows that most adhere to the canonical 95:5 kappa to lambda ratio. Grouped metrics of chain proportions, such as mean and standard deviation, are presented in Supplementary Table 3 and Supplementary Table 5 for human and mouse projects, respectively.

As described by Collins and Jackson31, the proportion of kappa to lambda light chains varies between the species and might be dependent on the organism’s size. While the low lambda usage in rodents could be partly due to limited combinatorial diversity as only four possible rearrangements of lambda V and J genes, it is not the only reason. Germline usage frequencies are highly variable and dependent on factors such as gene accessibility and their positions within the genome. Moreover, secondary rearrangements reinforce kappa gene usage.

We manually curated metadata for each bioproject, focusing on human and mouse datasets (Fig. 1). Human sequences make up approximately 90% of the dataset (Fig. 1A), reflecting their prevalence in antibody studies. Using the RIOT program, we assigned constant region (C gene) allotypes to the antibody sequences by aligning their constant regions to known human genes. For heavy chains, the majority were assigned as IGHM, followed by IGHG1 and IGHA1, with about one million sequences remaining unassigned (Fig. 1B). For light chains, most sequences were assigned as IGKC, highlighting the prevalence of kappa light chains, while lambda chain assignments were split among IGLC2, IGLC1, and IGLC3; approximately 0.5 million sequences were unassigned (Fig. 1C).

Fig. 1: Metadata associated with the 58 paired heavy/light bioprojects.
figure 1

A Sequences count stratified by organisms. B Heavy chain allotypes assigned by C region alignment. C C genes count assigned by C region alignment. D Sequences stratified by source tissue. E B cell types extracted from metadata. F Associated diseases.

Analyzing tissue sources, we found that the largest number of sequences originated from peripheral blood mononuclear cells (PBMCs), closely followed by peripheral blood samples (Fig. 1D). Significant numbers also came from lymph nodes (about 1 million sequences), bone marrow (0.7 million), and the spleen (0.3 million), with other tissues contributing roughly 0.6 million sequences. Regarding B-cell types, most sequences (roughly 6 millions) were unassigned; among the assigned, plasmablasts constituted the largest group, followed by plasma cells, reflecting the diversity of tissue sources and potential challenges in accurate cell type annotation (Fig. 1E). Lastly, when examining disease associations, the majority of sequences were unassigned; among the assigned sequences, those from healthy individuals were most prevalent, followed by sequences from COVID-19 patients, with smaller contributions from cancer patients and cases of allergic fungal rhinosinusitis (Fig. 1F).

High-level exploration of our data indicates that it is diverse in terms of sequence counts per bioprojects, and it does not have debilitating biases in terms of metadata associated with it. Using the opportunity of such a large dataset, we conducted an exploratory analysis of the pairing preferences. We focused on two aspects that were explored previously, germline-level and residue-level heavy-light pairing preferences.

Estimating the total number of antibody pairs

Given our large paired antibody sample, we aimed to estimate the total number of paired antibody sequences, similar to the way this is done to estimate the number of species on the basis of a smaller sample. This was previously done for the unpaired datasets by Briney et al.32 using the Chao2 estimator33. Briefly, the Chao2 estimator is a non-parametric method commonly used in ecology to estimate species richness, particularly effective for datasets where many species are rare or unobserved. It accounts for unseen diversity by considering the frequency of rare occurrences—in our case, antibodies observed only once or twice.

For this estimation, we considered concatenated sequences of variable regions only from heavy and light chains, originating from human studies. With this approach, we extracted the total number of unique heavy-light paired variable regions sequences in all studies to be 3,391,926 (Sobserved). Similarly the extracted number of unique sequences occurring in a single study is equal to 3,390,785 (q1), and the number of sequences occurring twice equal to 1139 (q2). Using the Chao2 estimator, we calculated the possible number of unique antibodies to be around 5 billion using Eq. 1.

$${S}_{{chao}2}={S}_{{observed}}+\frac{{q}_{1}^{2}}{2{q}_{2}}$$
(1)

On one hand, this appears to be a gross underestimate, given the known theoretical diversity of the variable regions. While this estimate highlights significant diversity within our dataset, it is lower than some other sources that suggest a higher magnitude of antibody variability of 1016 to 1018 32. This underestimation could be a lower bound estimate of the richness as it has been reported for Chao234. Moreover, a small sampling ratio could influence the estimated results.

On the other hand, it is known that though there is great theoretical variability of the antibody repertoires, there is evidence that the vast sequence space might be constrained in the structural and sampling fashion12,35,36,37. In particular, the variable regions of natural antibodies secreted by B-cells in the human body need to be significantly charged to assure their solubility, stability, and long circulation times38,39. Furthermore, constant regions of the monoclonal antibodies are highly charged, and likewise, charged variable regions help prevent antibody aggregation40. This could be a potential explanation of how such a vast sequence space for antibodies is efficiently restricted by our immune system to a smaller functional repertoire. As pointed out by Collins and Watson41, the human light chain repertoire exhibits a surprising lack of diversity due to several constraining factors. A primary reason is the strong bias in light chain gene usage. In the kappa light chain (IGK), just six IGKV sequences dominate reported rearrangements. Similarly, in the lambda light chain (IGLV), only three IGLV genes account for over 50% of human rearrangements, and the utilization of the four functional human IGLJ genes is notably biased. Additionally, the absence of D genes in light chain rearrangements further limits diversity.

With more paired sequences that appear to be in a steady supply (Supplementary Fig. 1), we will be able to see whether the diversity reaches a plateau. At this stage, we would interpret the results from Chao2 as informative but not conclusive.

Germline encoded pairing preferences

In humans and mice, when the initial rearrangement of the kappa light chain gene is unproductive or generates a self-reactive B-cell receptor, the B-cell can engage in additional rounds of secondary rearrangement through a process called receptor editing41. This mechanism enables the B-cell to modify its receptor specificity by replacing the problematic light chain, thereby resolving auto-reactivity. We investigated the presence of pairing preferences between the variable (V) germline genes of heavy and light chains, analyzing kappa and lambda chains separately. Germline usage stems from its accessibility and position in the genome. While some chain pairings arise due to germline usage frequencies in the dataset, we expect that receptor editing introduces a bias. Consequently, we anticipate that highly prevalent heavy-light chain pairs will exhibit lower auto-reactivity.

Firstly, we performed a Chi-squared test on the contingency table for the entire dataset, where columns and rows represented heavy and light V genes, respectively. This test confirmed significant pairing preferences in the dataset for both kappa and lambda chains (p « 0.01). To further investigate such preferences, we constructed contingency tables for each pair of heavy-light V gene subgroups. The null hypothesis was that H/L pair frequency is the same as the frequencies of H and L in the whole dataset. A detailed description can be found in the “Methods” section. This approach allowed us to assess pairing preferences at the V gene subgroup level by comparing their distributions to the remaining pairs in the dataset, following methodologies from previous studies42. The results are displayed as heatmaps of counts for kappa and lambda chains (Supplementary Fig. 2) as well as heatmaps of preferential pairings, representing for which gene pairs, the null hypothesis was rejected so the pairing frequencies surpasses the germline usage frequencies (Fig. 2). Because we are looking at individual pairs, we applied Bonferroni correction.

Fig. 2: V gene subgroup pairing preferences.
figure 2

A Lambda chains. B Kappa chains. Heatmaps represent if the frequencies of individual pairs of genes are the same as in the whole dataset—whether the null hypothesis was rejected or not.

Altogether, we note that there are visible biases in the counts of certain pairs (Supplementary Fig. 2), and there exists a mixed picture of pairs that exist beyond random likelihood, and those that appear purely by chance. These findings indicate that while some pairings result from gene distribution patterns, certain pairing preferences do indeed exist.

To evaluate whether our database provides a unique and enriched perspective on germline gene usage compared to existing public repositories, we conducted a comparative analysis with the OAS paired database. Employing the chi-square contingency test, we tested the null hypothesis that the germline assignment ratios are identical between the datasets. The analysis yielded a chi-square statistic of 138,899 with 55 degrees of freedom for the human heavy chain V genes, resulting in a P-value much smaller than 0.001. Similarly, comparing kappa and lambda chains yielded p-values close to zero, with chi-square of 19,965 and 17,588 and with 33 and 39 degrees of freedom, respectively. This highly significant result indicates that the proportions of germline gene usage differ substantially between the two datasets.

Consequently, our database significantly enriches the diversity of germline genes represented, offering a broader and more comprehensive perspective of antibody germline usage than currently available in other databases. Following the Chao2 estimator findings, it is also likely true that our current dataset has intrinsic biases, and as more paired antibody datasets come in, the natural germline pairing distributions among human and murine antibodies may be refined further to contain fewer biases.

How naturally-sourced pairs compare to clinical-stage monoclonal antibodies

Comparing naturally sourced datasets such as OAS and PairedAbNGS assumes an intrinsically similar, natural sampling distribution. This is unlikely the case for monoclonal antibodies, where one can expect that the pairing preferences reflect engineering practices rather than natural preferences. While the lead sequence optimization does not involve disturbing the light and heavy chain pairing of the parent clone for the fear of losing potency, some germlines are preferred due to their favorable liability profiles43.

To assess whether germline preferences in monoclonal antibodies align with those observed in natural repertoires, we plotted a heatmap that counted the usages of heavy and light chain genes in monoclonal antibodies (Fig. 3). This analysis uncovered a (known) strong bias in monoclonal antibody design toward the IGHV3-23 and IGHV1-46 germlines. The IGH3-23 contains a motif in the V-domain that is capable of protein A binding44. This preference is likely attributed to its prominence in observed repertoires and its favorable liability profile43, making it a popular choice in therapeutic antibody development.

Fig. 3: Heatmap of V gene subgroups in monoclonal antibodies.
figure 3

The counts represents the number of antibodies sharing the gene pair.

Another evident bias is the preference for Kappa light chains (Fig. 3). This is expected, as many therapeutics are derived from murine parental clones, via humanization, where Kappa chains constitute about 95% of the light chains. While Lambda chains have been reported to have poorer developability profiles because of higher average hydrophobicity45, this is a misconception. Only VL3 germline antibodies contain an aggregation prone region (APR) in the light chain FR2-CDR2 region which leads to aggregation, and can be easily fixed46,47. Furthermore, antibodies containing Kappa chains may be more prone to self-reactivity as it was suggested by Hedda Wardemann48. Moreover, Lambda chains may provide stability during an ongoing immune response, as their codon usage reduces the likelihood of structural changes arising from accumulating somatic point mutations41.

Overlooking lambda light chains in therapeutic antibody development can be counter-productive as it limits our ability to find potential binders against a target in the early stages and therefore the lead molecule choices of the antibody-based treatments. Adapting humanization techniques to include lambda light chains can mitigate the existing bias toward kappa chains in therapeutic antibody development, potentially leading to therapeutics with better efficacy, safety, and stability profiles that are more representative of the human immune response.

Structural analysis of contact residues on the heavy-light interface

At a finer granularity beyond germline pairing preferences, we explored residue contacts between antibody heavy and light chains as they are essential for understanding the molecular basis of their pairing specificity and overall function49. By analyzing these interactions, we studied key residues that contribute to the stability and conformation of antibodies, which have implications for antigen binding and therapeutic antibody design.

We explored antibody structures, deposited in RCSB to identify contact residues responsible for VH/VL interfaces. A similar study was carried out by Raybould et al. using Random Forests49. To explore those contact residues, the key interface residues, we imported antibody structures from the RCSB Protein Data Bank and extracted residues between the chains based on a distance threshold of 4.5 Å between any pair of heavy atoms from the heavy and light chains. Heavy atoms are defined as all atoms except the hydrogen. We present how often those residues are in contact (Fig. 4). Heatmap of these contacts was plotted to visualize interaction patterns (Fig. 5), and barplot/histograms for each IMGT position were generated to identify the type and frequency of conserved contact residues for heavy and light chains (Fig. 6).

Fig. 4: Relative frequencies of residues to be in contact in RCSB Protein Data Bank.
figure 4

For each IMGT ID, we calculated how often its heavy atoms are within 4.5 Å of the other antibody chain. Please note that certain IMGT IDs have higher occurrences, and the percentage here is taken as the number of contacts for a given IMGT ID out of the total number of times the residue occurs in the PDB. A Ratio of contacts for each IMGT ID for heavy chains. B Ratio of contacts for each IMGT ID for light chains.

Fig. 5: RCSB derived IMGT contact residues for heavy and light chains.
figure 5

A heatmap of counts of contacting residues between heavy and light chains as defined by 4.5 Å distance threshold. B Slope chart presenting inter-chain contacts.

Fig. 6: Amino acid distributions for key interface residues.
figure 6

For each key interface residue (<4.5 Å heavy atom distance in the PDB), we plot how often the position is seen in a given dataset (y-axis count), and on the bar, we color by proportion of amino acid types at that position. A Heavy chains from PDB. B Light chains from RCSB PDB. C Heavy chains from PairedAbNGS. D Light chains from PairedAbNGS.

Such proximal residues might affect VH/VL packing and thus affect binding affinity to the target. Notably, we observed that the CDR3 of each chain is in contact with the FWR2 of the opposing chain. This suggests that while the FWR2 region of one chain does not typically interact with antigens, it may influence the conformation of the CDR3 loop from the other, thereby affecting antigen binding. Influence of FW2 on the conformation of CDRs is a well-known issue in humanization. Additionally, the interaction between CDR3 loops contributes to chain pairing specificity, underscoring the importance of these regions in antibody structure and function. Our findings are consistent with previous studies that have reported similar interactions18,19,20,21,23,50 using smaller datasets.

We compared our key interface residues to Vernier residues as defined by Foote and Winter51 for heavy and light chains (Fig. 7). Vernier residues were translated to IMGT using IMGT lookup. We observed very small overlap between the positions.

Fig. 7: Key interface residues compared to Vernier residues in the IMGT numbering scheme.
figure 7

A Heavy chain. B Light chain.

Building upon the key interface residues extracted from the PDB, we analyzed their amino acid distribution in our large paired sequence database. For each key residue on either heavy or light chain, we plotted the distributions of amino acids on key interface residues (Fig. 6C, D). As the distributions qualitatively appeared similar, we quantified the difference between them using Total Variation Distance, defined as half of the sum of proportions of each amino acid, as defined by Eq. 2:

$${D}_{{IMGT}}=\frac{1}{2}{\sum}_{{aa}}|{P}_{{ds}1}\left({aa}\right)-{P}_{{ds}2}({aa})|$$
(2)

Here, the distance DIMGT, calculated for specific IMGT position, measures how the distributions of amino acids are different at that position between the datasets. P gives the proportions of a given amino acid (aa) from a datasets (e.g. PairedAbNGS, PDB). Therefore, the sum of all absolute values of differences between the proportions P in the datasets measures the absolute difference between their proportions. Please note the sum of the proportions is one at any position, for any dataset, normalizing for biases of different numbers of certain residues at different positions (Fig. 8).

Fig. 8: Total Variation Distances for each key residue between amino acid distributions.
figure 8

The closer the value is to zero, the closer the distributions. A PairedAbNGS vs PDB. B PairedAbNGS vs therapeutics. C PairedAbNGS vs OAS.

We calculated such measures between PairedAbNGS and PDB (Fig. 8A), therapeutics (Fig. 8B), and OAS (Fig. 8C). Most of the values are below 0.2, which can be roughly interpreted as 80% similarity between the amino acid distributions. This similarity indicates that the contact residues identified are conserved across both structural, therapeutic, and sequence data, reinforcing their significance in contributing to heavy-light chain pairing specificity. The consistent patterns suggest that these residues play a pivotal role in maintaining the structural integrity and functional conformation of antibodies.

To further examine whether amino acid distributions on key interface residues differ between databases, we performed a series of goodness of fit chi-square tests comparing our database, monoclonal antibodies, Paired OAS, and RCSB PDB. For each pair of databases (PairedAbNGS vs other), for each key residue identified in the RCSB, we constructed contingency tables where columns represented distinct amino acids. The null hypothesis was that the distribution of amino acids on specific positions is the same; therefore, the expected value for each amino acid was calculated as the proportion of particular AA on this position in our database, multiplied by the number of data points on the same position from the database compared. Categories with less than 5 observed amino acids were binned to the “other” category, and the results were Bonferroni corrected. While histograms with Total Variation Distances revealed moderate and high similarities in some residues distributions, in the majority of cases the null hypothesis was rejected with p ~ 0 (with the exception of highly conserved residues).

In particular, when comparing key residues between OAS and PairedAbNGS, there were no IMGT positions that showed the same distributions according to Chi square tests. This means that while highly similar (Fig. 8), PairedAbNGS extends the diversity landscape of the available data, not only in germline gene usage but also in sequence variability. The observed difference underscores that PairedAbNGS data can reshape our understanding of antibody diversity, with downstream implications for modeling, engineering, and applying immunoglobulin repertoires in both research and clinical contexts.

We investigated the inter-chain amino acid pairs of the previously identified pairs of key interface residues to gain deeper insights into their contributions to heavy-light chain pairing specificity. For each pair of IMGT positions corresponding to these key residues on the heavy and light chains, we calculated the Total Variation Distance of amino acid pairs. The Total Variation Distance served as a measure of the difference between the distribution of amino acid pairs in our dataset and the distribution expected if heavy and light chains were randomly paired. By plotting these results as a heatmap (Fig. 9), we visualized the extent of these differences across all residue pairs. Generally, the differences were small, likely due to random pairings being a very poor proxy for ‘incorrect pairing’ as they occasionally replicated correct pairings present in the actual dataset. However, we observed notable differences reaching up to 0.06 in Total Variation Distance values, indicating that the observed amino acid pairings differ significantly from random expectations at certain positions. In particular, the majority of significant differences were attributed to the CDR3 region of the heavy chain, reinforcing its critical role in chain pairing specificity. This finding suggests that specific amino acid interactions involving the heavy chain CDR3 region are key determinants in the proper pairing of antibody chains. This aligns with work of Monica L. Fernández-Quintero et al.52 who found that specific residues within the CDR-H3 and CDR-L3 loops can determine pairing preferences.

Fig. 9: Total Variation Distances for each pair of key residues between amino acid distributions.
figure 9

The closer the value is to zero, the closer the distributions. Comparing the distributions of cognate amino acids between sequence pairs from the database and randomly paired sequences.

Having identified the proximal contact of CDR3 and FWR2 from the pairing chains (Fig. 5), we explored the pairing preferences of inter-chain V and J genes (Fig. 10). Similar to Supplementary Fig. 2, we observe a strong bias towards some pairings (p-value ~ 0) which also might indicate pairing preferences arising from receptor editing in which autoreactive antibodies are removed. It is important to note, however, that while we observe a notable bias in J gene usage towards IGHJ4, it can also be attributed to recombination signal sequence (RSS) as it was investigated by Bin Shi et al.53.

Fig. 10: Inter-chain V-J germline pairing frequencies.
figure 10

The pairing preferences of inter-chain V and J genes subgroups, presented as the occurrence counts. A pairing of V genes from heavy chain with J genes from light chain. B pairing of V genes from light chain with J genes from heavy chain.

Discussion

We introduce PairedAbNGS, which to our knowledge, is the largest available dataset of paired heavy and light chain antibody sequences. This extensive dataset serves as a valuable resource to the community for several applications including repertoire analyses, biologic drug discovery, and machine learning applications in antibody design and engineering. By providing a comprehensive collection of paired sequences, PairedAbNGS enables Machine Learning models to learn the intricate relationships between heavy and light chains, leading to more accurate functional antibody predictions and the development of more effective antibody libraries with higher sensitivity and specificity along with desirable physicochemical attributes.

We examined pairing preferences between the variable (V) genes of heavy and light chains and identified significant biases toward certain pairs that go beyond gene usage frequencies alone. Using Chi-square contingency tests, we confirmed that specific heavy-light chain combinations are favored, likely due to receptor editing favoring combinations of stable molecules, with lower autoreactivity.

Exploring the key regions and residues at the heavy and light chain interfaces, we noted that the framework region 2 (FWR2) from one chain frequently contacts the complementarity-determining region 3 (CDR3) of the pairing chain. This observation led us to investigate pairing preferences for VH-JL (heavy chain variable region and light chain joining region) and VL-JH (light chain variable region and heavy chain joining region) gene segments. We identified statistically significant biases in these pairings, indicating that specific inter-chain V and J gene combinations contribute to stability and functionality of the antibody molecule.

These biases have important implications for antibody engineering, particularly in the humanization of monoclonal antibodies and the better utilization of lambda chains, which are reported to have lower autoreactivity. Heavy and light chain pairing is crucial for maintaining the correct antibody structure and function. Therefore during humanization, it is essential to consider not only the CDRs but also these key interface residues that contribute to chain pairing specificity and stability. Understanding these pairing preferences and inter-chain interactions can inform strategies to minimize autoreactivity and improve the efficacy of therapeutic antibodies. Our findings highlight the importance of considering heavy-light chain pairing biases and interfacial residues in antibody design and engineering.

As we focused in our analysis on static structures deposited in RCSB, it is important to notice that CDR3 loops are flexible and their lengths are dependent on VH, VL, and JH gene usage54. It is therefore important that the pairing preferences, the nature of inter-chain contacts, and its impact on antigen recognition should be investigated using conformational ensembles obtained from molecular dynamics simulations as it was introduced by Monica L Fernández-Quintero et al. The authors showed that varying light and heavy pairings impacts CDR conformations. This leads to changes in heavy/light chain orientations which further impacts antibody structure55,56.

Considering the potential of transcriptomic analysis as a future extension of our database could significantly enhance our understanding of antibody development and maturation. Some of the bioprojects within our database contain sequencing experiments that allow for the determination of protein expression levels. Integrating transcriptomic data would enable us to identify biomarkers relevant to antibody maturation and immune responses, which could be instrumental in discovering antigen-specific antibodies directly. Although we did not incorporate this analysis in the current study due to incompatibilities among data sources, future inclusion of transcriptomic information could provide deeper insights into the regulatory mechanisms governing antibody expression. This expansion would not only enrich the database but also offer a valuable resource for researchers aiming to explore the interplay between antibody structure, expression levels, and immune functionality.

Methods

Identification of bioprojects

NCBI SRA57 was used as the primary source of the sequences. Metadata for all bioprojects was imported from European Bioinformatics Institute European Nucleotide Archive (EBI ENA) portal58.

We extracted sequencing runs from NCBI SRA metadata where the library source was defined as “TRANSCRIPTOMIC SINGLE CELL”. For each unique bioproject, we imported its description from EBI ENA. In order to identify antibody-related projects, we constructed an LLM prompt and submitted it to the Gemma language model. Finally, the remaining projects were manually reviewed to discard studies incorrectly flagged as antibody-related. Furthermore, we applied the following inclusion criteria: 10x Chromium 5’ library, human or mouse, known data layout.

Prompt engineering59 is an evolving field focused on optimizing interactions with large language models to achieve specific outcomes. It involves crafting precise prompts to guide LLMs in generating accurate and context-appropriate responses, utilizing techniques such as chain of thought and emotion prompts to enhance problem-solving and emotional intelligence. Additionally, few-shot learning and self-reflection prompts are used to help LLMs quickly adapt to new tasks and improve the quality of their outputs. We have applied the above-mentioned prompt engineering patterns to create the prompt given in Listing 1.

NCBI SRA (Sequence Read Archive) is a public database consisting of various NGS studies. You are trying to scan existing studies (or bioprojects) to determine if they focus on antibodies or b-cells.

You will be provided paper abstract from NCBI SRA related bioproject. Your task is to

* provide short 1-2 sentences summarization of the abstract

* determine if the project contains on antibodies or b-cells. On the top of directly mentioned b-cells and antibodies in the abstract, projects investigating immune responses or antigens or vaccines should be considered as projects focusing on antibodies or b-cells.

* justify your answer with a short explanation. think step by step

* determine if the project directly focuses on antibodies or b-cells or if it is indirectly related to them.

* provide confidence level of your answer (low, medium, high)

Please provide the response in plain json format, for example given the following abstract:

Single cell sequencing for analysis gene of gene transcription and Ig repertoires of isotype-switched IgG-expressing memory B cells for paired analysis with Ig repertoires in the same cells. Aim of the study was to explore heterogeneity of isotype-switched memory B cells within and between spleen and bone marrow compartments. Overall design: Sequencing of RNA and Ig was performed in duplicates of IgG + B cells isolated from spleen and bone marrow of C57BL/6 mice.

The response would be:

{

“summary”: “Single cell sequencing of human BCR”,

“contains_antibodies”: “True”,

“justification”: “b-cells are mentioned in the abstract …”,

“relation”: “direct”,

“confidence_level”: “high”

}

Another example:

This study investigated the immune responses induced by an NS1-deleted influenza virus vectored intranasal COVID-19 vaccine (dNS1-RBD) in C57BL/6 mice, focusing on the single-cell RNA sequencing of lung tissue at different time points post-vaccination/infection.

Response:

{

“summary”: “This study investigated the immune responses induced by an NS1-deleted influenza virus vectored intranasal COVID-19 vaccine (dNS1-RBD) in C57BL/6 mice”,

“contains_antibodies”: “True”,

“justification”: “study investigated immune responses which and b-cells and antibodies are part of immune responses”,

“relation”: “indirect”,

“confidence_level”: “high”

}

This is very important for my career, I will become homeless if you make an error.

The response should contain only json contents string, no other information should be present in the response.

Listing 1. Bioproject verification prompt

This filtered the results to ca. 200 antibody-related bioprojects. Such a number was already manageable for further manual review that discarded false positives as well as cases that were not suitable for downstream analysis (e.g. species with projects for which we do not have a germline database). We focused on projects where a 10x Chromium 5’ library preparation pipeline was used.

This way we identified 58 studies, consisting of 2482 sequencing experiments with a total over 14 million assembled productive sequences constituting more than 7 million paired heavy and light sequences.

Reads were converted to FASTQ format using sra-toolkit after which they were extracted into separate files and assembled using TRUST4. Finally assembled consensus sequences were annotated using RIOT to IMGT scheme. The OGRDB database was used as a primary source for human and mouse VDJ genes in the RIOT annotation pipeline.

Processing pipeline

For each identified project, each read run is extracted to multiple fastq files using the sra-toolkit.

We use TRUST4 script to extract the reads to separate files containing barcodes, UMIs, and biological reads, respectively. This process is customized for each study, according to the read layout imported from NCBI trace server or library construction protocol described in study related paper (if present). From the existing bioprojects, we identified 2 groups of read layouts: single—where biological sequence was present on only one read—and paired.

In a single layout, read 1 contains the cell barcode and unique molecular identifier (UMI), which are crucial for identifying the cell of origin and removing PCR duplicates, respectively. Read 2 captures the actual cDNA sequence, which corresponds to the transcribed RNA from a particular gene60.

In paired layout, the length of read 1 is extended to cover both the cell barcode and a portion of the transcript sequence, a modification referred to as “10 × 5’ with extended R1”60. This approach can increase the number of unambiguous reads spanning important junctions, such as those found in viral subgenomic mRNAs, potentially improving the accuracy and depth of single-cell transcriptome analysis.

Distinct versions of 10x chemistry have different lengths of 10x barcodes and UMIs. While in single read layout, the length of the UMIs and barcodes is determined by read length, in paired reads we inferred it by localizing conserved Template Switch Oligo (TSO) on read 1.

Extracted reads are repartitioned by barcode to multiple directories to improve the parallelisation of the calculations, and TRUST4 is used again to assemble consensus sequences for each barcode. Finally, consensus sequences are annotated using RIOT.

Germline-encoded pairing preferences

To explore pairing preference between V germlines of heavy and light chains for kappa and lambda chains separately, we grouped sequences by the V gene subgroup, according to germlines assigned by RIOT. Since OGRDB mouse germlines do not have a straightforward group/subgroup/allele hierarchy, we focused only on human projects. We constructed a contingency table containing counts of V gene subgroups frequencies for heavy and light chains which was an input to Chi-square test. Pairs with observed co-occurrence of less than 5 were filtered out.

To investigate pairing preferences at higher granularities, for each pair of hl, we constructed the following contingency table:

$${hl\,h}^{\prime} l$$
$${hl}^{\prime} \,h^{\prime} l^{\prime}$$

where \({hl}\) is the count of occurrences of heavy(\(h\)) and light(\(l\)) V gene subgroups in the dataset, and \(h^{\prime}\) and \(l^{\prime}\) represents is the number of all other V genes subgroups, except \(h\) and \(l\) respectively. The null hypothesis is that the frequency of \({hl}\) stems from the frequencies of \(h\) and \(l\) in the dataset, therefore, expected values arises from proportion of \({hl}\), \(h^{\prime} l\), \({hl}^{\prime}\) and \(h^{\prime} l^{\prime}\). Significance levels were adjusted with Bonferroni correction, by dividing them by the number of tests performed.

IMGT key interface residues

We imported protein structures from the RCSB Protein Data Bank and numbered them to the IMGT scheme to identify antibody sequences. To accurately pair heavy and light chains, we measured the alpha carbon (Cα) distances of cysteines located at IMGT position 104 in both chains, pairing those within the distance of 20 Å, due to their conserved disulfide bonds. We extracted contact residues between the chains based on a criterion where any pair of heavy atoms (excluding hydrogens) from the heavy and light chains were within a maximum distance of 4.5 Å.

OAS database preprocessing

The OAS dataset was preprocessed using RIOT to ensure consistent and comparable germline V gene assignments across both databases. We grouped the assigned V genes into their respective subgroups and constructed a contingency table where each row represented a V gene subgroup and each column represented one of the two datasets.

Therapeutic database preprocessing

Manually collected therapeutic data was filtered to retain only standard monoclonal antibody formats therapeutics (monospecific antibodies with two pairs of heavy and light sequences). Since many therapeutic molecules originate from rodents with varying levels of humanization, sequences were annotated using RIOT using organism detection mode, so that the closest germline can be assigned, regardless of the level of mouse sequence content.

Statistics and reproducibility

To investigate the germline pairing preferences at high level, Chi-squared contingency tests were performed using the chi2_contingency method from scipy.stats python package. Only human sequences were used, with the number of occurrences greater or equal to 5. This yielded a total n = 2607209 and n = 3958735 heavy—lambda and heavy—kappa chain pairs respectively.

The same filtering criteria were applied to investigate the pairing preferences on higher granularity - only human sequences with at least 5 observed sequences in the dataset were considered. The number of occurrences of each gene pair in the dataset is available in the Supplementary Data 1.xlsx file.

To compare the pairing distribution of PairedAbNGS to OAS we performed a chi-squared test, comparing the distribution of the frequencies of V gene subgroups in both databases. The frequencies can be obtained by grouping the pairing frequencies available in the Supplementary Data 1.xlsx file.

To compare the distributions of amino acids on key contact residues between the databases we performed a series of chi-squared tests. In cases where there were less than 5 occurrences of given amino acid at specific IMGT position, the counts were binned to the “other” category. The sample sizes for each test can be easily derived from the residues in contact frequencies from the Supplementary Data 1.xlsx file.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.