Abstract
Antibody (Ab) structures and antibody-antigen (Ag) interactions (AAIs) are essential for understanding immune recognition and designing Ab therapeutics. While existing structural Ab databases provide valuable insights, they still face limitations in data accuracy, completeness, and/or update frequency. Here, we present SAAINT-parser, a computational workflow for rapid, accurate, and robust extraction of Ab and AAI information from the Protein Data Bank (PDB). SAAINT-parser features precise detection of Ab chains, accurate pairing of Ab chains, and reliable identification of AAIs. The resulting SAAINT-DB, last updated on May 1, 2025, contains 19,128 data entries from 9757 PDB structures, offering a comprehensive and up-to-date resource. Detailed analyses highlight the advantages of SAAINT-DB over the widely used SAbDab in terms of data accuracy and completeness. Furthermore, SAAINT-DB provides nearly twice as many non-redundant, manually curated Ab-Ag binding affinity entries as SAbDab. To support Ab-related research and benefit the broader scientific community, we provide open access to SAAINT-parser, the SAAINT-DB summary file, unprocessed PDB structures, and SAAINT-parser-processed structure models at https://github.com/tommyhuangthu/SAAINT.
Similar content being viewed by others
Introduction
Antibodies (Abs), also known as immunoglobulins (Igs), are essential immune system components, providing defense against infections and foreign substances. Their high affinity and specificity for antigens (Ags) have made them invaluable in basic research, biotechnology, and biomedical applications [1, 2]. Structural biology and biophysical studies on Ab structures and Ab-Ag interactions (AAIs) are crucial for understanding the molecular determinants of Ab specificity, binding affinity, and other properties, facilitating the rational design and optimization of Abs for therapeutic and diagnostic purposes.
Recent breakthroughs in artificial intelligence, particularly deep learning, have revolutionized protein structure prediction and design. The availability of extensive, diverse protein structural data has played a pivotal role in these advancements. Trained on a vast portion of the Protein Data Bank (PDB) [3], the Nobel Prize-winning AlphaFold [4] system, related networks [5, 6], and their successors [7, 8] outperform traditional methods in modeling Ab structures and AAIs [9, 10]. Similarly, the RFdiffusion [11] network, fine-tuned on Ab complex structures, has demonstrated the ability to accurately design de novo single-chain Abs [12].
The PDB [3] serves as the primary repository for 3D structural data of large biological molecules, including Abs. As of this writing in May 2025, the PDB contains over 235,000 entries, yet only a small fraction (~4%) contains Ab structures. Several specialized structural Ab databases [13,14,15,16,17,18,19] have been developed, integrating data from the PDB and other resources. Among them, IMGT/3Dstructure-DB [13] is one of the earliest databases dedicated to immunological proteins, including Abs. Other notable databases include BEID (B-cell Epitope Interaction Database) [14], which compiles sequence-structure-function data on AAIs, and AgAbDb (Antigen-Antibody interaction Database) [16], a knowledgebase developed to compile, curate, and analyze determinants of AAIs. PyIgClassify [17] specializes in the classification of complementarity-determining regions (CDRs), while AB-Bind [18] provides a set of binding mutational data with Ab structures for evaluating computational AAI modeling methods. AbDb (Antibody structure Database) [19] focuses on the Ab Fv (Fragment variable) regions with their conjugate Ags.
Among these, SAbDab (Structural Antibody Database) [15] stands out as the most comprehensive repository. According to its authors, SAbDab contains all the publicly available Ab structures annotated with key experimental details, heavy chain (HC)-light chain (LC) pairings, and Ag information, and manually curated Ab-Ag binding affinity data. As of its most recent update on May 2, 2025, SAbDab includes 9521 PDB structures and 18,744 data entries, surpassing other structural Ab databases described above. In comparison, IMGT/3Dstructure-DB [13], the second largest, last updated on May 23, 2024, contains 7243 PDB structures and 9053 data entries. Given its breadth and regular updates, SAbDab currently represents the most extensive structural Ab database available.
While these databases provide valuable structural data for Ab-related research, they still face certain limitations. For instance, our detailed examination of SAbDab reveals a noticeable amount of missing or incorrectly annotated data, which can hinder accurate analysis and computational modeling.
To address these issues, we present SAAINT-parser, a computational workflow designed for fast and accurate processing of PDB entries to extract structural Ab and AAI information. This has enabled the construction of SAAINT-DB, a novel and comprehensive structural Ab database. Both SAAINT-parser and SAAINT-DB are released as open-source resources, aiming at advancing Ab-related research and benefiting the broader scientific community.
Materials and methods
Data source
The mmCIF files of the whole PDB repository were downloaded to a local high-performance computing cluster following the wwPDB guidelines at https://www.wwpdb.org/ftp/pdb-ftp-sites. Each entry’s corresponding FASTA file and PDB information were retrieved from https://www.rcsb.org/fasta/entry/pdbid and https://www.rcsb.org/structure/pdbid, respectively, where pdbid represents a valid PDB identifier (ID).
FASTA and mmCIF data structure
For each PDB entry, the FASTA file contains sequences for all macromolecular entities recorded in the mmCIF file. Each entity consists of one or more chains with ID(s) labeled in label_asym_id (assigned by the PDB) and often auth_asym_id (designated by the deposition authors). All chains within an entity share the same sequence (FASTA-seq). The entity also includes a molecule name and the species from which it originates.
The mmCIF file provides atomic coordinates for both the entities recorded in FASTA and other unrecorded molecules, such as ions, cofactors, water molecules, and nonpolymeric ligands. In this work, we focused on the structures of Abs and their interactions with proteins, peptides, and nucleic acids while ignoring other molecules. Although the mmCIF file encodes all the information found in the FASTA file, both formats were used because FASTA, being simpler, enables more efficient data processing.
Identifying and labeling Ab chains
Each FASTA-seq was analyzed using AbRSA [20] to determine whether it contains an Ab chain. Based on this analysis, each FASTA-seq was categorized into one of the following types: heavy chain (HC), light chain (LC), heavy_light chain (HLC), or non-Ab. If all chains were labeled as non-Ab, the entry was deemed to lack an Ab chain and was discarded.
For PDB entries containing a single model with at least one Ab chain, each protein or peptide structure was individually extracted and saved as a separate PDB file using Biopython [21]. The corresponding primary sequence (PDB-seq), composed of residues with atomic coordinates, was then extracted. To ensure accurate sequence mapping, the PDB-seq was aligned to the FASTA-seq using Biopython’s Align module with customized parameters (Supplementary Table 1). By aligning the PDB-seq to the FASTA-seq, we determined the positions of the first and last aligned residues. A gap-filled sequence (Filled-seq), defined as the fragment of the FASTA-seq spanning from the first to the last aligned position, was generated and subjected to a second AbRSA analysis to examine the presence of an Ab chain within the structure.
For PDB entries with multiple models (such as NMR structure ensembles), each model was extracted and processed separately. This approach also applies to the following steps, ensuring that HCs and LCs from different models within the same entry are not incorrectly paired (see below).
The rationale for using Filled-seq in AbRSA type labeling is that it more closely reflects the actual structure compared to FASTA-seq while incorporating missing residues to ensure sequence continuity compared to PDB-seq. If a Filled-seq was identified as HC or LC, it was further analyzed using Abalign [22] to determine its V gene subgroup. However, sequences labeled as HLC were excluded from subgroup analysis, as their hybrid nature—either as a scFv from the same species or a chimera of VH and VL domains from different species—could lead to inaccurate predictions.
Some structures, despite containing Ab chains, lack typical Ab structural domains. For instance, AL amyloid fibrils result from the misfolding of Ab LCs (e.g., PDB ID: 7nsl [23]). To exclude such cases, TM-align [24] was used to compare these structures against high-resolution, well-structured reference domains—VH (PDB ID: 6cnw [25], chain A, X-ray resolution: 0.92 Å) or VL (PDB ID: 4unu [26], chain A, X-ray resolution: 0.95 Å)—based on their AbRSA classification as HC or LC. Ab chains with a TM-score [27] (normalized to the reference) below 0.4 were discarded. Notably, all LCs in 7nsl exhibited TM-scores below 0.2 when compared to 4unu, confirming their lack of Ab domains.
Pairing HCs, LCs, and HLCs
Once AbRSA types were assigned, Ab chains were paired based on their level of interaction. However, structural limitations such as extensive missing residues/atoms or models containing only Cα atoms could hinder accurate interaction calculations. To address this, we used Pulchra [28] to reconstruct full-atom protein models from the Cα traces or incomplete structures. During reconstruction, nonstandard amino acids and residues lacking Cα atoms were excluded, as Pulchra cannot process them correctly. The repaired Ab chain structures were then analyzed using AbRSA_PDB [20] to identify CDR residues. Finally, all the repaired chains were merged into a single complex structure and subjected to side-chain repacking using FASPR [29] to minimize clashes.
Ab chains were then paired both within and between entities using UniDesign [30, 31] to compute the number of interface residues between potential pairing chains (\({N}_{HL\_inf\_res}\)). A residue was considered part of the interface if at least one of its nonhydrogen atoms was within 5 Å of a nonhydrogen atom in the other chain [32, 33]. Typically, a VH-containing chain was paired with a VL-containing chain. Pairing HLCs was more complex, as these chains contain both VH and VL and could pair with HCs, LCs, or other HLCs.
To ensure accurate pairings, a variety of minimum \({N}_{HL\_inf\_res}\) cutoffs were applied, based on chain lengths (Supplementary Table 2). For example, two scFv chains (both are HLCs) were considered a valid pair if their \({N}_{HL\_inf\_res}\) was at least 80, whereas a VH-VL pairing (HC and LC) required only ≥20 interface residues. This filtering step could help to exclude chain pairs that were in proximity but did not form a true functional interface.
Under these criteria, one chain may still structurally pair with multiple partner chains. However, each chain should ideally pair with only one other chain or remain unpaired. To resolve cases where an HC pairs with multiple LCs or vice versa, we employed a combination of greedy search and iterative heuristic search, guided by a carefully designed score function, to maximize the number of valid HL pairs within the entire structure.
In the greedy search procedure, the initial HC-LC pairings were determined using the following steps: (1) Each unpaired Ab chain was temporarily paired with a dummy chain to form a placeholder HL pair. (2) All HL pairs were scored (see the following subsection) and ranked from the highest to lowest based on their pairing scores. (3) A greedy HC-LC pairing solution was constructed by iteratively selecting the highest-scoring HL pairs, ensuring that each chain was only used once. This process continued until all real Ab chains were covered. The total greedy score was then computed by summing the pairing scores of the selected HL pairs. (4) Unused HL pairs were pooled for further evaluation.
The initial pairing solution by the greedy search may not be optimal. To further refine the pairings, an iterative heuristic search was performed as follows: (1) Starting from the greedy pairing solution, the highest-scoring HL pairs were progressively replaced with unused HL pairs to construct a new HC-LC pairing solution while ensuring that each chain was only used once. (2) A new total pairing score was calculated for the updated solution. (3) The substituted HL pairs were returned to the unused pool. (4) This process was repeated until replacing pairs no longer resulted in a higher total score.
This combined strategy ensured a more accurate and stable HC-LC pairing assignment than the greedy search alone, resolving ambiguities and optimizing the structural pairing of Ab chains.
Score function for HC-LC pairing
Each HL pair was scored using the following scoring function:
where RHL and PHL represent the reward and penalty terms for HC-LC pairing, respectively. These terms are defined as follows:
where \({N}_{HL\_inf\_res}\) denotes the number of interface residues between HC and LC.
where \(M{I}_{H\_inf\_res}\) and \(M{I}_{L\_inf\_res}\) represent the mean indices of interface residues on HC and LC, respectively. The averaging factor of \(\frac{1}{2}\) accounts for interface residues on both the HC and LC, while the floor function \(\left\lfloor \cdot\right\rfloor\) ensures an integer output (e.g., an averaged residue index). The scaling factor of 3 in \({R}_{HL}\) balances the reward and penalty term.
Calculating Ab chain mean radius
The mean radius of an Ab chain was calculated using the following equation:
where \(n\) is the number of heavy atoms in the chain, \({x}^{i}\) represents the coordinates of atom \(i\), and \(\bar{x}\) denotes the averaged coordinates of all heavy atoms.
Determining Ab chain types
The Ab chain type classification was determined based on multiple factors, including the AbRSA type, molecular name, species name, mean radius, and the lengths of PDB-seq, Filled-seq, and FASTA-seq. Briefly, an HC was classified into one of the following types: FabH, VH, VH+ (an HC longer than a typical VH), VH- (an HC shorter than a typical VH), VHH, VHH+, and VHH-. In contrast, an LC was categorized as FabL, VL, VL+, or VL-. LCs do not include VHH, VHH+, or VHH- types, as these are exclusive to HC-only Abs. An HLC was classified as scFv, scFv+, or scFv- if its molecular name contained keywords such as “single,” “scFv,” or “Fv” and its mean radius was ≤20 Å. Otherwise, it was categorized as VHVL (a VH-VL fusion of typical scFv size but not explicitly identified as scFv), VHVL+, or VHVL-. Supplementary Table 2 provides the sequence length cutoffs used for Ab type determination.
Pairing Ab and Ag chains
Each Ab was systematically paired with each Ag chain using the same approach as for HC-LC pairing. Specifically, a potential Ab-Ag complex was constructed, and UniDesign [30, 31] was used to identify its interface residues.
An Ag chain was classified as a true Ag if it met all three of the following conditions: (1) the number of Ab-Ag interface residues (\({N}_{ab\_ag\_inf\_res}\)) was ≥10; (2) the number of CDR residues at the interface (\({N}_{CDR\_inf\_res}\)) was ≥5; and (3) the ratio of the number of interface CDR residues to the number of interface Ab residues (\({R}_{CDR\_inf\_res}\)) was ≥0.25. All Ag chains that met these criteria were grouped together, meaning that each Ab could interact with one or multiple Ag chains.
Ab-Ag binding affinity data
To ensure accurate integration of experimentally determined Ab-Ag binding affinity data into SAAINT-DB, each PDB entry was carefully cross-referenced with the PDB-associated publication and related references. Binding affinity data, presented as the dissociation constant (KD) in units of nM, was reported when available. Overall, various techniques were used to determine binding affinity, including surface plasmon resonance (SPR), biolayer interferometry (BLI), isothermal titration calorimetry (ITC), enzyme-linked immunosorbent assay (ELISA), kinetic exclusion assay (KinExA), solution equilibrium titration (SET), flow cytometry, and interferometry. In the current version, when different methods were used for an AAI, ITC data were prioritized for its accuracy, and the geometric mean was used if multiple affinity values were obtained using the same technique (e.g., SPR). We plan to publish all available affinity values without prioritizing a specific one in future development. The temperature of the affinity measurement was recorded if explicitly stated, standardized to 298 K for room temperature, or marked as “N.A.” if unspecified.
Recording Ab and AAI information
For each PDB entry identified as containing an AAI or Ab, essential information was extracted and recorded, encompassing a total of 43 attributes related to the Ab, Ag, and PDB (Supplementary Table 3). For unpaired HCs without an LC partner such as scFv, LC-related fields were assigned either “N.A.” or “0,” depending on the data type. Similarly, for unpaired LCs such as VL, HC-related fields were marked as “N.A.” or “0.” If an Ab was not paired with an Ag, the Ag-related fields were also set to “N.A.” or “0.” All other unavailable fields were recorded as “N.A.”
Building and updating SAAINT-DB
To construct SAAINT-DB for the first time, all mmCIF files from the PDB were downloaded to a local cluster using remote synchronization (rsync), while the associated FASTA files were retrieved via wget. The PDB entries were distributed across hundreds of CPUs, with each CPU processing <1000 entries using SAAINT-parser. Based on our tests, the initial processing on 300 CPUs took approximately ten hours to complete. Once processed, the processed Ab and AAI data entries for individual PDB entries were merged to generate the SAAINT-DB.
To update SAAINT-DB, rsync was used to detect removed, modified, and newly added mmCIF files, and the corresponding FASTA files were updated accordingly. The Ab and AAI data entries were removed for deleted PDB entries, re-computed for modified ones, and newly computed for newly added entries. Only these updated PDB entries were assigned to a smaller number of CPUs, each running SAAINT-parser to process a few entries. Our tests showed that the update process typically took less than two hours, depending on the number of updated PDB entries. Finally, the processed Ab and AAI data entries were merged to update the SAAINT-DB.
Since Ab-Ag binding affinity data must be manually collected and curated from the literature, it lagged behind the SAAINT-DB structure data and was maintained separately.
Results
The SAAINT-parser workflow
The SAAINT-parser workflow extracts paired Abs, unpaired Ab chains, and AAIs from experimentally solved structures using their PDB IDs as input (Fig. 1a). It consists of three main modules that process the PDB-associated FASTA file, mmCIF file, and web content, respectively. These modules generate intermediate data, which are then integrated to infer the outcomes. Below is a description of the key components, with further details provided in the Methods section.
The FASTA module processes chain entities and sequences from the FASTA file, utilizing AbRSA [20] to determine whether a FASTA-seq contains Ab variable domains and, if possible, to identify Ab residues (i.e., FASTA Ab residues). The mmCIF module employs Biopython [21] to parse the complete structural model from the mmCIF file. PDB-seqs consisting of residues with Cα atomic coordinates are extracted from the structure, aligned to the corresponding FASTA-seq, and used to generate the Filled-seq, which then undergoes a second AbRSA analysis. The entire structure is further divided into individual chains, each reindexed based on the alignment for Filled-seq creation and repaired using Pulchra [28] to restore missing atoms. Each repaired chain is analyzed to identify protein residues (i.e., PDB residues) with atomic coordinates and subjected to AbRSA_PDB [20] to detect CDR residues if classified as an Ab chain. The repaired individual chains are then concatenated and optimized using FASPR [29], which repacks protein side chains to minimize steric clashes. Additionally, the web module retrieves key PDB information by directly accessing the PDB webpage.
During the integration phase, AbRSA types inferred from FASTA-seqs and Filled-seqs are combined with data such as FASTA Ab residues and PDB residues to determine the final AbRSA types. Dependent on chain entities and final AbRSA types, all possible HC-LC pairs (HL pairs) are extracted from the FASPR model and analyzed with UniDesign [30, 31], yielding a set of relatively reliable HL pairs, which are further optimized by a combination of greedy search and iterative heuristic search to ensure that one HC is paired with at most one other LC or vice versa (see Methods). The final HL pairs or unpaired HCs/LCs, along with the web-fetched PDB title, are then used to determine Ab types.
The AAI identification process follows a similar approach to identifying HL pairs. For each Ab, all possible Ab-Ag pairs are extracted and analyzed by UniDesign. An Ab-Ag pair is considered valid if its interface contains a relatively higher number of residues (e.g., ≥10) and a relatively higher proportion of CDR residues (e.g., ≥0.25). To ensure comprehensive representation, all Ag chains associated with a given Ab are merged. All identified AAIs and/or Abs are recorded in SAAINT-DB.
Finally, besides the data generated by SAAINT-parser, manually curated Ab-Ag binding affinity data is also incorporated for AAIs based on relevant publications.
Statistics and analysis of SAAINT-DB
The SAAINT-DB comprises 19,128 data entries derived from 9757 PDB entries released between May 19, 1976 and April 30, 2025 (deposited between March 17, 1976 and March 31, 2025) (Fig. 1b). Over time, the number of experimentally determined Ab-involved PDB entries has steadily increased each year, with a notable increase between 2020 and 2023, likely driven by the global COVID-19 pandemic related research.
The dataset is predominantly composed of structures determined by X-ray diffraction (6261 structures, 64.2%) and electron microscopy (EM) (3469 structures, 35.5%), while other experimental methods account for only 27 structures (0.3%) (Supplementary Fig. 1a). X-ray structures exhibit resolutions ranging from 0.92 to 8.0 Å with a median resolution of 2.4 Å (Supplementary Fig. 1b). Among them, 3605 (57.6%) have a moderate resolution of ≤2.5 Å, indicating moderate or better quality. The R-free and R-work values range from 0.122 to 0.458 (median: 0.249) and from 0.102 to 0.441 (median: 0.205), respectively (Supplementary Fig. 1c). EM structures have resolutions spanning 1.7 to 35.0 Å, with a median of 3.3 Å (Supplementary Fig. 1b). Of these, 3200 (92.2%) have a resolution of ≤4.5 Å, indicating medium or better quality.
The basic structure of an Ab usually consists of HCs and LCs, each made up of Ig domains including VH (Variable domain of HC) and VL (Variable domain of LC), which contain the CDRs and FRs (Framework Regions) and constant regions (CH and CL). Ab structure is functionally divided into two main regions: Fab (Fragment antigen-binding) and Fc (Fragment crystallizable). These regions are connected by a flexible hinge, allowing the Fab arms to move and bind to Ags at various angles. Fv, the smallest functional unit of an Ab responsible for Ab binding and specificity, consists of VH and VL.
The HCs and LCs of paired Abs may originate from different sources. Additionally, some Ab entries consist of a single chain, requiring separate source analysis for HCs and LCs. The top five species sources of HCs are Homo sapiens, Mus musculus, Lama glama, Vicugna pacos, and synthetic constructs (Supplementary Fig. 2a). For LCs, the top sources are Homo sapiens, Mus musculus, Macaca mulatta, Oryctolagus cuniculus, and synthetic constructs (Supplementary Fig. 2b).
As with any structural Ab database, Ab structures form the foundation of SAAINT-DB. However, the Ab structures in the PDB to a large extent lack a unified Ab format. Fully solved Abs, consisting of all Ab domains, are rare; instead, Fab and Fv are the most common formats. Another widely studied format is VHH (Variable Heavy domain of Heavy chain only Ab), also known as a nanobody. Additionally, various engineered Ab formats exist, such as scFv (single-chain Fv). Due to experimental limitations, some Ab structures contain missing residues or domains, leading to discrepancies between deposited structures and their annotations. For example, an Ab is annotated as a Fab, whereas its atomic coordinates only include Fv domains. In this situation, it would be more appropriate to classify it as an Fv rather than a Fab.
To address this, SAAINT-parser implements a robust method for automatically categorizing Ab entries into the most appropriate types, such as Fab, Fv, VHH, scFv, and others, by considering sequence data, structural features, and PDB annotations (see Methods). To explicitly denote paired chains, we introduce chain-specific Ab types for both HCs and LCs. For instance, instead of simply labeling an Ab as Fab, we specify it as FabH:FabL. Similarly, we use VH:VL instead of Fv to represent paired variable domains. Overall, SAAINT-DB defines 29 Ab types; among them, the most common are FabH:FabL, VH:VL, VHH, and scFv, with respective counts of 9377, 3643, 3283, and 1377 entries (Fig. 2a). This aligns with the known prevalence of Fab, Fv, VHH, and scFv in solved Ab structures.
Protein chain length plays a critical role in classifying Ab types. However, determining Ab type solely from its FASTA-seq can be unreliable, as these sequences may include a large number of non-Ab residues. To improve classification accuracy, we analyze lengths from multiple sequence representations: FASTA-seq, PDB-seq, and Filled-seq, with a particular focus on the latter two. It is well established that VH, VHH, and VL domains are single-domain structures of approximately 110 amino acids (aa), whereas FabH and FabL include both a variable and constant domain, making them roughly twice as long (~220 aa). As shown, many VH FASTA-seqs exceed 400 aa, whereas their PDB-seqs and Filled-seqs remain ~110 aa (Fig. 2b). Similarly, while many FabH FASTA-seqs surpass 400 aa, their PDB-seqs and Filled-seqs remain ~220 aa. Additionally, VH-, VL-, and scFv- (representing incomplete VH, VL, or scFv structures) are generally shorter than their parental types (Fig. 2b, c). In contrast, VH+, VHH+, VL+, and VHVL+ exhibit broader length distributions (Fig. 2b, c), as they usually represent chimeric constructs combining Ab domains with other protein regions.
For a given Ab type, the number of interface residues in HL pairs (\({N}_{HL\_inf\_res}\)) can vary substantially (Supplementary Fig. 2c), reflecting differences in pairing angles [34] and packing densities between HCs and LCs. For example, this number ranges from as few as 20 aa to as many as 80 aa in VH:VL types. Both VHVL and scFv consist of a VH and a VL domain, and their PDB-seqs cluster ~220 aa (Fig. 2b, c), making it difficult to distinguish them by sequences. However, a key structural difference is that scFv’s VH and VL domains pack more tightly, resulting in a smaller mean radius (see Methods). We found that a cutoff of 20 Å can effectively differentiate scFvs from VHVLs with perfect accuracy (Supplementary Fig. 2d).
Among the 19,128 entries in SAAINT-DB, 14,316 (74.8%) are classified as AAIs, where each Ab interacts with one or more Ag chains, including proteins, peptides, DNAs, and RNAs (Supplementary Fig. 3a, b). Regarding Ag sources, the most common species are Homo sapiens, SARS-CoV-2, HIV-1, influenza A, and Plasmodium falciparum (Supplementary Fig. 3c).
AAIs exhibit substantial diversity, with the number of residues at the Ab-Ag interfaces (\({N}_{ab\_ag\_inf\_res}\)) ranging from 10 to 185, despite most falling between 30 and 60 (Fig. 3a). A key structural feature of AAIs is that Abs primarily engage Ags through their CDR residues. Accordingly, a large portion of CDR residues is present at the Ab-Ag interfaces, with counts (i.e., \({N}_{CDR\_inf\_res}\) values) ranging from 5 to 40 (Fig. 3b), reflecting varied binding characteristics. The proportion of interfacial CDR residues to total interfacial Ab residues (\({R}_{CDR\_inf\_res}\)) spans from 25% to 100%, with most exceeding 70% (Fig. 3c), consistent with the knowledge that most interfacial Ab residues reside within the CDR regions.
a Histogram of the number of residues at antibody-antigen interfaces; the inset shows a zoomed-in view for numbers exceeding 100. b Histogram of the number of CDR residues at antibody-antigen interfaces; the inset shows numbers exceeding 30. c Histogram of the ratio of interfacial CDR residues to the total number of interface residues. The inset shows ratios below 0.4. d Histogram of pKD values. The insets show zoomed-in views for values below 6 and above 11.
In developing Ab therapeutics, a critical step is to optimize an Ab’s affinity for its target. Therefore, integrating Ab-Ag binding affinity data into SAAINT-DB is essential. For each PDB entry, we carefully reviewed its associated and relevant publications to extract the binding affinity data, if available. However, manually checking literature is laborious and very time-consuming. As of this writing, we have reviewed 2866 out of the 9757 PDB entries, 1331 of which are associated with binding affinity data, accounting for 1444 non-redundant data entries in SAAINT-DB. The binding affinity, in terms of dissociation constant (KD), varies in a wide range from high micromolar to sub picomolar, with most in the nanomolar range, with pKD values ranging from 4 to 14 with a median of ~8.5 (Fig. 3d). Surface plasmon resonance (SPR) and biolayer interferometry (BLI) are most widely used for affinity measurement (Supplementary Fig. 3d).
Classification of Ab types in SAAINT-DB
A major advancement of SAAINT-DB over existing structural Ab databases is its detailed classification of Ab types, making the assessment of classification accuracy a critical component.
Figure 4 presents structural examples for five selected Ab types in SAAINT-DB, ilustrating a high degree of consistency between algorithmic classification and structural visualization. For instance, Fab 1nby consists of an HC and LC with VH-CH1 and VL-CL domains, respectively (Fig. 4a). Fv 1a14 features a VH:VL pair (Fig. 4b), while VHH 1jtt consists of a single VH domain from Camelus dromedarius (Fig. 4c). The scFv 1dzb represents a fused VH-VL construct (Fig. 4d). The VH+:FabL category is particularly complex, encompassing full-length Abs, B-cell receptors, Fabs with ultralong CDR-H3, IgEs with Cε2, and Fabs with domain insertions in the HC (Fig. 4e).
SAAINT-DB can also accurately assign Ab types even when Ab structures have numerous missing residues (Supplementary Fig. 4). For example, in VH- 6pw6, only two CDR fragments are present in the HC, while scFv- 6kn9 shows a large number of missing residues. In VH-:VL- 7ujd, both chains retain only three CDR loops. VHH- 8jys has only two CDR fragments. In VH-:VL 7dk7, the VL is intact, but the VH contains just one CDR fragment. Similarly, in VH:VL- 8vzo, the VH is complete, but the VL is notably incomplete.
Interestingly, certain types, such as VH, VL, and FabL, can exist either as monomers or homodimers (Supplementary Fig. 5). Notably, FabL dimers, also known as Bence-Jones proteins, consist of Ab LCs and are produced by abnormal plasma cells. These proteins are clinically significant as they are often associated with conditions like multiple myeloma and other plasma cell dyscrasias, indicating monoclonal gammopathy. However, to prioritize HC-LC pairings and discourage LC-LC and HC-HC pairings, SAAINT-parser splits these homodimeric pairings into monomers. Additionally, SAAINT-DB consistently and reliably categorizes Ab types in other cases as well (Supplementary Fig. 5).
Comparison with existing databases
The IMGT/3Dstructure-DB, last updated on May 23, 2024, contains 9053 data entries and 7243 PDB entries. AbDb, with its most recent update on July 26, 2019, includes 5976 complete Ab Fvs and unpaired HCs or LCs derived from 3348 PDB entries. BEID and AgAbDb are currently inaccessible.
SAbDab, last updated on May 2, 2025, has 18,744 data entries (representing Fv regions) from 9521 PDB entries. Of these, 7622 PDB entries contain at least one paired VH/VL, and 7788 include Ags. Additionally, SAbDab provides 736 manually curated Ab-Ag affinity data points (Fig. 5a).
a Statisics of SAAINT-DB and SAbDab. Note that only proteins, peptides, and nucleic acids are considered potential antigens in SAAINT-DB. b Venn diagram illustrating the difference in the number of PDB entries between SAAINT-DB and SAbDab. c Two examples of typical Ab structures present in SAAINT-DB but absent from SAbDab. d SAAINT-parser failure for PDB entry 7o85 due to incorrect FASTA sequences. e SAAINT-parser failure for 2h3n caused by missing C-terminal residues. In contrast, its companion entry 2h32, which lacks missing residues, was successfully processed. The C-terminal residues absent from 2h3n but present in 2h32 are highlighted in magenta. Protein structures are visualized using PyMOL.
In comparison, SAAINT-DB, updated on May 1, 2025, comprises 19,128 data entries from 9757 PDB entries. It includes 7822 structures with at least one paired VH/VL, and 7489 with Ags (Fig. 5a). Of the 2866 out of 9757 PDB entries reviewed, 1444 non-redundant Ab-Ag binding affinity data points were collected, associated with 1331 PDB structures.
We then investigated the difference of these entries in greater details. Our analysis revealed that SAAINT-DB includes 242 PDB entries absent from SAbDab (Fig. 5b; Supplementary Table 4), such as the VHH A52 from Lama glama (PDB ID: 1i3u) and the anti-ciguatoxin Fab 10C9 (PDB ID: 2z93) (Fig. 5c). Some of these Ab structures are excluded from SAbDab, partly due to the application of stricter structural filters. Conversely, SAbDab contains six PDB entries not present in SAAINT-DB (Fig. 5b). Four of them (2f5a, 8xn9, 8xnh, and 8zfj) are obsolete and replaced by successor entries. For the remaining two entries, 7o85 lacks correctly assigned Ab FASTA-seqs despite being annotated as containing a Fab by the deposition authors (Fig. 5d), while 2h3n, the structure of a surrogate LC homodimer, could not be correctly handled by AbRSA likely due to a large number of missing residues at the C-terminus, as its companion entry 2h32 (no missing residues) was successfully processed (Fig. 5e). Additionally, SAbDab failed to pair Abs in 60 PDB entries which were faithfully paired in SAAINT-DB (Supplementary Table 5).
Discussion
While several specialized structural Ab databases exist, they still exhibit limitations in data accuracy, completeness, and/or update frequency. This study introduces a comprehensive and accurate database—SAAINT-DB—designed to complement or potentially improve over existing resources. To achieve this, we developed SAAINT-parser, the first open-source workflow for fast and accurate parsing of PDB entries to extract structural Abs and AAIs. A detailed comparison with SAbDab and other databases highlights the advantages of SAAINT-DB in both accuracy and completeness. To aid readers, we summarize methodological similarities and key differences among SAAINT-DB, SAbDab, and AbDb in Supplementary Table 6.
Accurate pairing of HCs and LCs is a crucial step in SAbDab, AbDb, and SAAINT-DB. This process relies not only on the precise identification of Ab chains but also on correctly matching corresponding HCs and LCs. We used AbRSA to identify Ab chains and CDR regions, leveraging its high accuracy and speed [20]. Existing databases rely on different strategies for HC-LC pairing. SAbDab enforces a constraint requiring the conserved cysteine at Chothia position 92 on an HC to be within 22 Å of the conserved cysteine at position 88 on an LC [15]. While effective in most cases, this criterion is overly restrictive for certain structures and fails when one or both of these cysteines are missing, which may partly explain why SAbDab sometimes identifies Ab chains that should be paired as unpaired (Supplementary Table 5). AbDb, on the other hand, maximizes the number of atomic contacts within 4 Å to determine pairings [19]. While this approach works well for single-domain structures (e.g., VHs and VLs) in building AbDb, it can lead to incorrect pairings in complex cases such as domain-swapped Fabs.
To overcome these challenges, SAAINT-parser implements a robust HC-LC pairing procedure. First, we defined thresholds for \({N}_{HL\_inf\_res}\) to filter out false pairings while improving computational efficiency. Second, we prioritized high-confidence pairings using a composite scoring function that combines \({N}_{HL\_inf\_res}\) (similar to AbDb’s contact-based approach) with the mean index of interface residues. Finally, we employed a hybrid approach combining greedy search and iterative heuristic search to maximize the number of valid HL pairs across the entire structure.
This method ensures accurate pairings even in very complicated cases. For example, in PDB entry 8d01, which contains two HCs (A and H) and two LCs (B and L), the number of interface residues for pairings AB, AL, HB, and HL are 44, 50, 52, and 48, respectively. AbDb’s rule predicts AL and HB as the correct pairings, whereas the actual pairings are AB and HL (Fig. 6a). A more complex example is PDB entry 2oqj, which consists of four HCs (B, E, H, and K), four LCs (A, D, G, and J), and four Ag chains (C, F, I, and L). The number of interface residues for HC-LC pairings BA, EA, BD, ED, HG, KG, HJ, and KJ are 51, 60, 56, 50, 48, 57, 55, and 50, respectively. Under AbDb’s rule, the predicted pairings would be EA, BD, KG, and HJ, while the correct pairings are BA, ED, HG, and KJ (Fig. 6b).
Another advantage of SAAINT-DB is its detailed classification of Ab types based on both sequence information and structural integrity. SAAINT-DB also highlights several of the most adundant Ab types, such as FabH:FabL, VH:VL, VHH, and scFv (Fig. 2). In contrast, AbDb focuses solely on VH/VL domains, which can be limiting for researchers interested in Fabs. In its database summary file, SAbDab provides only a coarse annotation indicating whether an Ab is an scFv, without detailed classification of Ab types. As a result, users need to manually inspect the PDB structure to determine whether an Ab is a Fab, Fv, or another type. For advanced users aiming to collect structural data for specific Ab types in modeling and design tasks, this can require significant extra effort. In contrast, SAAINT-DB’s comprehensive Ab type classification streamlines this process.
In addition, regular updates are essential for any database. SAbDab is well updated with the latest update on May 2, 2025. In contrast, IMGT/3Dstructure-DB and AbDb are updated less frequently, with their most recent updates on May 23, 2024 and July 26, 2019, respectively. Our tests showed that scanning the entire PDB with 300 CPUs (3.0 GHz Intel Xeon Gold 6154) took only a few hours to build the initial SAAINT-DB and even less time for updates (see Methods). Therefore, we conclude that SAAINT-DB could be consistently kept up to date with the PDB.
Despite its advantages, SAAINT-parser and SAAINT-DB have certain limitations. First, SAAINT-parser relies on AbRSA for Ab chain labeling, making its accuracy dependent on AbRSA’s precision. While AbRSA is highly accurate, it can occasionally fail. For example, in PDB entry 8d53, AbRSA failed to identify the HC of 35022scFv, likely due to a missing nine-residue fragment in its FASTA-seq. However, AbRSA and SAAINT-parser successfully identified other Ab chains, including Fab PGT124. Notably, this entry is completely absent from SAbDab. Second, classifying engineered or unusually long Ab chains remains somewhat ambiguous. For instance, VH+ indicates the presence of a VH domain but does not specify the exact number of VH domains or provide structural details of non-VH regions. Similarly, VH+ :FabL encompasses diverse Ab structural configurations (Fig. 4h). Third, SAAINT-parser and SAAINT-DB currently support only protein, peptide, RNA, and DNA Ags, limiting their applicability to other Ag types such as carbohydrates and haptens (nonpolymeric ligands). Addressing these limitations is a priority for future development. Additionally, due to limited resources, we do not currently provide a dedicated web server for hosting SAAINT-DB. As a result, the database may be less accessible and user-friendly compared to platforms such as SAbDab and AbDab, which offer intuitive graphical interfaces, data visualization tools, and well-documented application programming interface for programmatic access. While SAAINT-DB emphasizes data accuracy, completeness, and detailed Ab type classification, we acknowledge that its lack of a web-based interface and interactive tools may pose a barrier for some users, particularly those without computational expertise. Future development of a web-accessible platform—with filtering options, visualization features, and integration with other immunoinformatic tools—would further enhance the usability and impact of SAAINT-DB.
In summary, this study introduces SAAINT-parser, an advanced tool for the effective and efficient processing of PDB structures to extract structural Ab information. By applying this tool to the PDB, we have constructed a new, comprehensive structural Ab database. Detailed analyses and comparisons with existing databases highlight the strengths of SAAINT-DB in terms of data completeness, accuracy, and update frequency.
Data availability
The SAAINT-DB summary file and binding affinity data (in spreadsheets), unprocessed PDB structures, and SAAINT-parser-processed structure models are available at https://github.com/tommyhuangthu/SAAINT. The complete PDB repository can be downloaded from https://files.wwpdb.org/.
Code availability
The SAAINT-parser source code is available at https://github.com/tommyhuangthu/SAAINT. It is built on Python 3.10 and Biopython 1.84. FASPR (v.20200309) can be accessed at https://github.com/tommyhuangthu/FASPR. UniDesign (v.1.2) is available at https://github.com/tommyhuangthu/UniDesign. AbRSA (v.2.20) can be downloaded from http://cao.labshare.cn/AbRSA/download.html, while AbRSA_PDB (v.2.00) is available through Dr. Yang Cao at Sichuan University. Abalign (v.2.00) can be downloaded from http://cao.labshare.cn/abalign/. TM-align (v.20220412) is available at https://zhanggroup.org/TM-align. Lastly, Pulchra (v.3.06) can be accessed at https://github.com/euplotes/pulchra.
References
Maynard J, Georgiou G. Antibody engineering. Annu Rev Biomed Eng. 2000;2:339–76.
Sliwkowski MX, Mellman I. Antibody therapeutics in cancer. Science. 2013;341:1192–8.
Berman HM, Battistuz T, Bhat TN, Bluhm WF, Bourne PE, Burkhardt K, et al. The protein data bank. Acta Crystallogr Sect D Biol Crystallogr. 2002;58:899–907.
Jumper J, Evans R, Pritzel A, Green T, Figurnov M, Ronneberger O, et al. Highly accurate protein structure prediction with AlphaFold. Nature. 2021;596:583–9.
Evans R, O’Neill M, Pritzel A, Antropova N, Senior A, Green T, et al. Protein complex prediction with AlphaFold-Multimer. bioRxiv. 2021. https://doi.org/10.1101/2021.10.04.463034.
Baek M, DiMaio F, Anishchenko I, Dauparas J, Ovchinnikov S, Lee GR, et al. Accurate prediction of protein structures and interactions using a three-track neural network. Science. 2021. https://doi.org/10.1126/science.abj8754.
Baek M, Anishchenko I, Humphreys IR, Cong Q, Baker D, DiMaio F. Efficient and accurate prediction of protein structure using RoseTTAFold2. bioRxiv. 2023. https://doi.org/10.1101/2023.05.24.542179.
Abramson J, Adler J, Dunger J, Evans R, Green T, Pritzel A, et al. Accurate structure prediction of biomolecular interactions with AlphaFold 3. Nature. 2024;630:493–500.
Liang T, Jiang C, Yuan J, Othman Y, Xie XQ, Feng Z. Differential performance of RoseTTAFold in antibody modeling. Brief Bioinform. 2022;23:bbac152.
Yin R, Pierce BG. Evaluation of AlphaFold antibody-antigen modeling with implications for improving predictive accuracy. Protein Sci. 2024;33:e4865.
Watson JL, Juergens D, Bennett NR, Trippe BL, Yim J, Eisenach HE, et al. De novo design of protein structure and function with RFdiffusion. Nature. 2023;620:1089–100.
Bennett NR, Watson JL, Ragotte RJ, Borst AJ, See DL, Weidle C, et al. Atomically accurate de novo design of single-domain antibodies. bioRxiv. 2024. https://doi.org/10.1101/2024.03.14.585103.
Kaas Q, Ruiz M, Lefranc MP. IMGT/3Dstructure-DB and IMGT/StructuralQuery, a database and a tool for immunoglobulin, T cell receptor and MHC structural data. Nucleic Acids Res. 2004;32:D208–10.
Tong JC, Song CM, Tan PT, Ren EC, Sinha AA. BEID: database for sequence-structure-function information on antigen-antibody interactions. Bioinformation. 2008;3:58–60.
Dunbar J, Krawczyk K, Leem J, Baker T, Fuchs A, Georges G, et al. SAbDab: the structural antibody database. Nucleic Acids Res. 2014;42:D1140–6.
Kulkarni-Kale U, Raskar-Renuse S, Natekar-Kalantre G, Saxena SA. Antigen–antibody interaction database (AgAbDb): a compendium of antigen–antibody interactions. Immunoinformatics. 2014;1184:149–64.
Adolf-Bryfogle J, Xu Q, North B, Lehmann A, Dunbrack RL Jr. PyIgClassify: a database of antibody CDR structural classifications. Nucleic Acids Res. 2015;43:D432–8.
Sirin S, Apgar JR, Bennett EM, Keating AE. AB-Bind: antibody binding mutational database for computational affinity predictions. Protein Sci. 2016;25:393–409.
Ferdous S, Martin ACR. AbDb: antibody structure database-a database of PDB-derived antibody structures. Database. 2018;2018:bay040.
Li L, Chen S, Miao Z, Liu Y, Liu X, Xiao ZX, et al. AbRSA: A robust tool for antibody numbering. Protein Sci. 2019;28:1524–31.
Cock PJ, Antao T, Chang JT, Chapman BA, Cox CJ, Dalke A, et al. Biopython: freely available Python tools for computational molecular biology and bioinformatics. Bioinformatics. 2009;25:1422–3.
Zong F, Long C, Hu W, Chen S, Dai W, Xiao ZX, et al. Abalign: a comprehensive multiple sequence alignment platform for B-cell receptor immune repertoires. Nucleic Acids Res. 2023;51:W17–W24.
Radamaker L, Karimi-Farsijani S, Andreotti G, Baur J, Neumann M, Schreiner S, et al. Role of mutations and post-translational modifications in systemic AL amyloidosis studied by cryo-EM. Nat Commun. 2021;12:6434.
Zhang Y. Skolnick J. TM-align: a protein structure alignment algorithm based on the TM-score. Nucleic Acids Res. 2005;33:2302–9.
Laursen NS, Friesen RHE, Zhu X, Jongeneelen M, Blokland S, Vermond J, et al. Universal protection against influenza infection by a multidomain antibody to influenza hemagglutinin. Science. 2018;362:598–602.
Brumshtein B, Esswein SR, Landau M, Ryan CM, Whitelegge JP, Phillips ML, et al. Formation of amyloid fibers by monomeric light chain variable domains. J Biol Chem. 2014;289:27513–25.
Zhang Y, Skolnick J. Scoring function for automated assessment of protein structure template quality. Proteins. 2004;57:702–10.
Rotkiewicz P, Skolnick J. Fast procedure for reconstruction of full-atom protein models from reduced representations. J Comput Chem. 2008;29:1460–5.
Huang X, Pearce R, Zhang Y. FASPR: an open-source tool for fast and accurate protein side-chain packing. Bioinformatics. 2020;36:3758–65.
Huang X, Zhou J, Yang D, Zhang J, Xia X, Chen YE, et al. Decoding CRISPR–Cas PAM recognition with UniDesign. Brief Bioinforma. 2023;24:bbad133.
Huang X, Sun Y, Osawa Y, Chen YE, Zhang H. Computational redesign of cytochrome P450 CYP102A1 for highly stereoselective omeprazole hydroxylation by UniDesign. J Biol Chem. 2023;299:105050.
Huang X, Pearce R, Zhang Y. EvoEF2: accurate and fast energy function for computational protein design. Bioinformatics. 2020;36:1135–42.
Huang X, Zheng W, Pearce R, Zhang Y. SSIPe: accurately estimating protein-protein binding affinity change upon mutations using evolutionary profiles in combination with an optimized physical energy function. Bioinformatics. 2020;36:2429–37.
Dunbar J, Fuchs A, Shi J, Deane CM. ABangle: characterising the VH-VL orientation in antibodies. Protein Eng Des Sel. 2013;26:611–20.
Acknowledgements
This work was supported by the Universitity of Michigan Medical School internal funds to the Center for Advanced Models for Translational Sciences and Therapeutics (CAMTraST). We thank the Advanced Research Computing (ARC) at the University of Michigan for providing the computational resources and services that supported this research.
Author information
Authors and Affiliations
Contributions
XH conceived and conducted the project. XH, JZ, SC, XX, YEC, and JX conducted validation of the database. XH wrote the paper. XH, YEC, and JX edited the paper. All authors participated in discussions and agreed with the final paper.
Corresponding author
Ethics declarations
Competing interests
XX, YEC, and JX are equity holders of ATGC Inc. XX is an employee of ATGC Inc. ATGC Inc. holds multiple technology licenses from the University of Michigan. The other authors declare that they have no conflict of interest.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Huang, X., Zhou, J., Chen, S. et al. SAAINT-DB: a comprehensive structural antibody database for antibody modeling and design. Acta Pharmacol Sin (2025). https://doi.org/10.1038/s41401-025-01608-5
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41401-025-01608-5