Main

Single-nucleotide variants (SNVs) are the most abundant class of genetic variants segregating in the human population and are at the same time easy to access using microarray or short-read sequencing platforms. Unsurprisingly, virtually all genome-wide association studies (GWAS) seeking to map genotypes to phenotypes have been focusing on SNVs. In contrast, structural variants (SVs), which are 50 bp in size or longer, are much more challenging to characterize; more than half of all SVs per sample are missed by short-read-based variant discovery1,2,3, despite their biomedical relevance4,5. Almost 750 genes contain ‘dark’ protein-coding exons, where read mapping and variant calling cannot be adequately performed6; around 400 medically relevant genes are almost inaccessible because of their repetitive nature and high polymorphic complexity7. Of them, 273 genes are widely used for variant calling and assembly benchmarking8,9. Long-read technologies are needed to address this problem10,11,12 and recent long-read-based genome assembly strategies indeed led to haplotype-resolved genome assemblies of diploid samples that routinely resolve many previously intractable complex genetic loci13,14. Nevertheless, long-read sequencing of large cohorts remains prohibitively expensive, signifying the need for accurate short-read-based genotyping.

In the meantime, high-quality assemblies are available for hundreds of human haplotypes and give rise to a pangenome reference2,8,15. The genetic variation encoded therein can serve as a basis for genotyping workflows by mapping reads to a pangenome graph16,17 or through k-mer-based genome inference18. While genome inference with Pangenie18 has expanded the set of accessible SVs considerably8, it exhibits limitations at complex loci with few unique k-mers. As an alternative strategy, methods for targeted genotyping of genes of special interest, such as the HLA, KIR and CYP2 gene families, have been developed19,20,21,22,23,24.

In this study, we propose a new tool, called Locityper, to leverage genome assemblies in a pangenome reference or custom collection of locus alleles for fast targeted genotyping of complex loci. Locityper is a general-purpose genotyper that can efficiently process both short-read and long-read data; it integrates a range of different signals based on read depth, alignment identity and paired-end distance in a statistical model to infer genotype likelihoods. This provides an opportunity to genotype and analyze a diverse set of previously understudied genes for already available large sequencing datasets, such as the 1000 Genomes Project cohort and large biobanks like the All-of-Us25 program and the UK Biobank (UKB)26, where disease association studies can be performed.

Results

Overview of the method

Locityper is a targeted genotyping tool designed for structurally variable polymorphic loci. For every target region, Locityper finds a pair of haplotypes (locus genotype) that explain the input whole-genome sequencing (WGS) dataset in a most probable way. Locus genotyping depends solely on the reference panel of haplotypes, which can be automatically extracted from a variant call set representing a pangenome, or provided as an input set of sequences. Before genotyping, Locityper efficiently preprocesses the WGS dataset and probabilistically describes read depth, insert size and sequencing error profiles. Next, Locityper uses minimizers to recruit reads to all target loci simultaneously.

At each locus, Locityper estimates a likelihood for every possible locus genotype by distributing recruited reads across possible alignment locations at the corresponding haplotypes (Fig. 1). The likelihood function is defined in such a way to prioritize read assignments with a smaller number of sequencing errors; plausible insert sizes across the read pairs; and stable read depth without excessive dips or rises. We show that finding a maximum likelihood read assignment can be formulated as an integer linear programming (ILP) problem or identified through stochastic optimization (Methods). Finally, Locityper identifies a genotype with the highest joint likelihood and outputs the most probable read alignments to the two corresponding haplotypes.

Fig. 1: Illustration of the locus genotyping approach.
figure 1

a, Reference panel of four locus haplotypes (AD). b, WGS reads, recruited to any of the haplotypes. For illustrative purposes, haplotypes and reads are colored using homologous blocks (information, unavailable to Locityper). c, Optimal assignments of reads to various genotypes, where the small red squares show read alignment mismatches or indels. Genotype A,B has the highest joint likelihood because of a small number of alignment errors and no lack or excess of read depth.

Locityper accurately genotypes challenging loci

To evaluate Locityper’s targeted genotyping accuracy, we used a reference panel of 90 haplotypes from phased whole-genome assemblies8 across 256 target loci (Methods) covering 13.9 Mb and fully encompassing 265 challenging medically relevant (CMR) genes7 and 23 other protein-coding genes (Supplementary Table 1).

To measure the haplotyping error, we calculated sequence divergence between actual and predicted haplotypes (Fig. 2a) and corresponding Phred-like27 quality values (QVs), which are widely used for genome assembly evaluation28. Then, we distributed haplotype predictions into five bins based on their QV (<17, 17–23, 23–33, 33-43 and ≥43), where a haplotype from the last two bins (QV ≥ 33) differs from an actual haplotype by no more than 5 bp per 10 kb (Fig. 2b), which is competitive with long-read genome assemblies from Oxford Nanopore Technologies (ONT) data29. Note that the haplotypes were compared across the whole locus, including both coding and noncoding regions, which avoids the need for gene annotations on highly variable haplotypes.

Fig. 2: Haplotype accuracy definition and analysis at 256 CMR loci.
figure 2

a, The haplotyping error was calculated as the sequence divergence between actual and predicted haplotypes. The QV is a Phred-like transformation of the haplotyping error. b, Approximate correspondence between haplotyping error and QV bins. cg, Fraction of haplotypes across 256 loci and multiple samples, distributed into five QV bins (Supplementary Table 2). The median QV is shown above each bar. c, Locityper accuracy in the LOO configuration. d, Haplotype availability (QV between actual and closest available LOO haplotypes). e, Haplotyping accuracy of the 1KGP call set, as well as Sniffles and Sniffles + DeepVariant variant calling. f, Concordance of Locityper predictions across 563 unrelated trios. g, Locityper accuracy using the full reference panel. h,i, Correspondence between haplotype availability and haplotyping accuracy based on Illumina (h) and PacBio HiFi (i) WGS datasets. The red lines mark a 0, 5-point and 10-point QV loss.

First, we genotyped 40 Illumina WGS datasets from the Human Pangenome Reference Consortium (HPRC) cohort. Each dataset was processed using the leave-one-out (LOO) approach, where the two relevant sample haplotypes were excluded from the reference panel. Across 20,350 cases where locus haplotypes were fully assembled, Locityper achieved a median QV = 35.27, with 58.8% haplotypes having QV ≥ 33 (15.2% for QV ≥ 43). On the other hand, 9.1% haplotypes had QV = 17–23 and 5.1% haplotypes had QV ≤ 17 (Figs. 2c and 3). Instead of unmapped reads, Locityper can process existing alignments, substantially accelerating the read recruitment stage. This does not lead to lower accuracy; Locityper predictions for ten mapped WGS datasets showed virtually identical results (median QV = 35.25; Supplementary Table 2).

Fig. 3: Locityper haplotyping accuracy using an LOO reference panel for 40 Illumina WGS datasets.
figure 3

Predicted haplotypes across 256 CMR loci were binned into five groups according to their haplotyping QV.

Even though HPRC assemblies are very accurate, they may include assembly or phasing errors, especially at challenging loci. To remove this factor from the performance analysis, we used ART Illumina30 to simulate 44 short-read datasets and processed them with Locityper. As expected, the tool showed higher accuracy on simulated datasets, producing a median QV = 35.65, with 60.7% and 4.0% haplotypes receiving QV ≥ 33 and <17, respectively (Supplementary Fig. 1a).

Locityper is not limited to short reads and can process various long-read WGS datasets, including PacBio HiFi and ONT data. For these technologies, Locityper achieved higher median QVs of 36.90 and 35.95, respectively, and produced 66.6% and 64.5% haplotypes with QV ≥ 33 (18.7% and 14.4% with QV ≥ 43), while only 2.9% and 2.0% haplotypes had QV < 17 (Extended Data Fig. 1 and Supplementary Fig. 1b).

Locityper achieves near-optimal LOO accuracy

By design, Locityper always associates an input WGS sample with two existing locus haplotypes. Therefore, Locityper LOO accuracy is limited to haplotype availability, that is, similarity between the actual haplotypes and the closest haplotype remaining in the LOO panel. Overall, 66.8% haplotypes had close counterparts in the LOO panel (QV ≥ 33; 20.0% for QV ≥ 43) (Fig. 2d and Extended Data Fig. 2). Inversely, 1.2% and an additional 6.5% haplotypes were dissimilar from any unrelated haplotype (QV < 17 and 17–23).

An optimal solver, which always finds the closest genotype from the LOO panel, would achieve a median QV = 36.93, just 1.66 points higher than Illumina-based Locityper and 0.03 higher than HiFi-based. For Illumina datasets, Locityper underperforms on average by just 2.03 QV points compared to the theoretical best, with 95.1% (86.8%) haplotypes trailing by under ten (five) QV points (Fig. 2h). Even further, across PacBio HiFi datasets, Locityper predictions differ from optimal by 0.72 QV points on average; this number drops down to 0.54 when considering well-represented haplotypes (availability ≥33). Overall, 98.7% (96.0%) HiFi-based haplotypes were within the ten (five) point margin (Fig. 2i).

This analysis shows that Locityper performs extremely well when required haplotypes are present in the reference panel, and achieves near-optimal accuracy with only limited haplotype sets. Growing numbers of haplotypes in pangenomes15 are likely to increase Locityper accuracy even further.

Locityper outperforms variant calling pipelines

By identifying the two most similar locus haplotypes to a given WGS dataset, Locityper effectively infers the two haplotype sequences at a locus. This provides an opportunity to benchmark Locityper against any phased variant call set, which likewise can be interpreted as a prediction of both haplotype sequences. Consequently, we evaluated the New York Genome Center (NYGC) call set for the expanded 1KGP (1000 Genomes Project) cohort of 3,202 samples3, of which 39 have HPRC assemblies. Even though the NYGC pipeline uses state-of-the-art variant callers, 1KGP haplotypes had significant divergence from the actual sample haplotypes: only 27.4% haplotypes achieved QV ≥ 33 and another 22.3% haplotypes had QV < 17, while the median QV was 24.41, almost 11 points smaller than Locityper on Illumina reads (Fig. 2e and Extended Data Fig. 3).

While short-read datasets are difficult to genotype at complex loci, PacBio HiFi data are arguably the easiest. To put Locityper performance in perspective we examined phased SV calls, generated by Sniffles31 for 20 HiFi datasets. As Extended Data Fig. 4a shows, Sniffles alone did not achieve high levels of accuracy, producing a median QV = 25.09. Combining SVs with short variant calls, produced by DeepVariant32, raised the median QV to 35.19, which is 1.71 points behind Locityper on the same data and 0.08 points behind Illumina-based Locityper. While Sniffles + DeepVariant (Extended Data Fig. 4b) produced a larger fraction of poor haplotypes (4.7% and 7.9% with QV < 17 and 17–23 against 2.9% and 6.7% for Locityper), this pipeline also produced a bigger share of extremely accurate haplotypes (21.7% against 18.7%), probably because of Locityper’s inability to call new variants.

Locityper produces concordant trio predictions

Additionally, we genotyped the full 1KGP cohort of 3,202 Illumina WGS samples, including 563 trios independent from the HPRC cohort. At each of the target loci and for each trio we calculated concordance, that is, the similarity between child and parent haplotypes (Methods). As Fig. 2f shows, the vast majority of trio haplotypes were concordant: 64.8% and 81.7% with QV ≥ 43 and ≥33, respectively. Moreover, the median concordance QV surpassed 44.4 and was over 43 at 90% of the loci (Extended Data Fig. 5).

Almost perfect accuracy with a full reference panel

Finally, we examined Locityper’s ability to accurately identify true sample haplotypes using a full reference panel. This experiment should mimic future pangenomes, where almost all haplotypes present in the population would also exist in the reference panel. At each of the sequencing technologies, Locityper achieved an extremely high median QV (>47) and produced more than 93% haplotypes with QV ≥ 33. Illumina-based and ONT-based haplotypes showed slightly lower accuracy: 83.1% and 81.6% had QV ≥ 43, respectively, while only 1.0% and 0.4% had QV < 17. On the other hand, simulated short reads and PacBio HiFi datasets produced almost perfect haplotypes: 96.6% and 95.1% with QV ≥ 43 and ≈0.1% with QV < 17 (Fig. 2g, Extended Data Fig. 6 and Supplementary Fig. 2). A variant call set obtained from the Locityper haplotypes using the full reference panel and Illumina data showed a significantly higher F1 score than the 1KGP call set, as well as higher precision and recall compared to the pangenome-based variant caller Pangenie18 (Supplementary Fig. 3 and Supplementary Information).

Locityper accurately genotypes HLA and KIR genes

To evaluate Locityper’s ability to genotype hyperpolymorphic genes, we examined genes from two medically relevant genomic regions: the major histocompatibility complex (MHC), covering over 4 Mb and over 200 genes33, and the KIR gene cluster spanning 150 kb and 17 genes34. The two regions contain extremely polymorphic HLA and KIR genes, which have an essential role in adaptive and innate immune systems35,36. As Locityper genotypes target loci based solely on the sequences of available haplotypes, it is not limited to gene bodies and can use the intergenic sequence, gene order and presence and absence of copy-number-variable genes. As such, Locityper can predict missing genes by selecting padded haplotypes that lack the gene of interest.

Multiple specialized tools have been developed for genotyping the MHC locus19,22,37, the newest being T1K23, a state-of-the-art38 genotyper for HLA and KIR genes that is capable of processing whole-genome and whole-exome short-read sequencing data. To compare T1K and Locityper accuracy, we genotyped 40 Illumina HPRC WGS datasets at 26 genes and 14 pseudogenes from the MHC locus and 14 genes and three pseudogenes from the KIR locus, all combined into 33 target loci with a sum length of 1.15 Mb.

In the LOO configuration, at the MHC locus, Locityper achieved a full match with assembly-based allele annotation (correctly predicted all fields in the HLA nomenclature39,40) in 88.8% cases, compared to T1K’s 64.1% (Fig. 4a). At the same time, the two methods correctly predicted the protein product (second nomenclature field) in 95.1% and 78.2% of cases, respectively. Meanwhile, at the KIR gene cluster, Locityper and T1K correctly predicted protein products in 84.9% and 67.1% cases and achieved full match in 80.8% and 57.9% cases, respectively (Fig. 4b). When using the full reference panel, which also containing the input samples, Locityper achieved almost perfect accuracy: full match in 99.4% and 99.9% of cases at the MHC and KIR loci, respectively.

Fig. 4: Haplotyping accuracy for 40 HPRC samples at the MHC and KIR loci.
figure 4

a,b, Subpanels showing the fraction of haplotypes, predicted with varying accuracy at 40 pseudogenes from the MHC locus (a) and 17 pseudogenes from the KIR gene cluster (b). Fully predicted alleles and correctly identified missing copies, are colored dark blue (full match) because of the different number of allele fields in the HLA/KIR gene nomenclature39,58. Otherwise, haplotypes are colored according to the number of correctly predicted fields. Accuracy is shown for Locityper with the full reference panel (F), Locityper in the LOO setting with and without weights (denoted L and W, respectively) and T1K. The last entry in each panel shows the average accuracy across all corresponding genes and pseudogenes.

Unlike T1K, Locityper does not distinguish between exons, introns and intergenic space. This may result in lower accuracy when a haplotype carrying a false gene allele better explains input reads within a noncoding sequence. To handle such cases, users may use a weighted Locityper mode, giving lower weight to read depth and read alignments occurring outside exons. Using a weight of 0.1 for introns and 0.005 for intergenic regions, Locityper’s accuracy rose to 96.5% (protein product) and 90.8% (full match) at the MHC locus, and to 89.9% and 86.2% at the KIR cluster (Fig. 4).

Some protein products were present in only one HPRC sample; consequently, such samples cannot be correctly annotated by Locityper in the LOO setting. Such cases explained 64.6% and 36.2% of all errors made by Locityper in the weighted mode at the MHC and KIR loci, respectively (Extended Data Fig. 7). This is especially noticeable at the hyperpolymorphic HLA-A, HLA-B and HLA-DRB1 genes, where protein groups were missing from the LOO panel in 10–22% of cases, which explains the vast majority of Locityper errors. At the same time, T1K often predicted a smaller copy number than required, explaining 79.1% and 24.6% of all errors at the MHC and KIR clusters, respectively. When ignoring these two error types (missing copy and unavailable protein groups), Locityper notably outperformed T1K in predicting protein products: 99.0% against 94.5% at the MHC locus, and 94.6% against 73.0% at the KIR gene cluster. Overall, the general-purpose tool Locityper performed in a competitive manner even when compared to T1K, which was specifically designed for HLA and KIR genes. However, accurate genotyping of the most diverse genes would still probably benefit from larger pangenome sizes.

Accurate genotyping of disease-relevant gene families

Although the set of CMR genes included a wide variety of genetically diverse genes, several important polymorphic gene families were underrepresented in it. The mucin genes are a highly heterogeneous gene family (MUC1MUC24)41. Mucin genes encode large glycoproteins that are essential to barrier maintenance and the defense of epithelial tissues. All canonical mucins harbor a large exon that contains variable number tandem repeats (VNTRs), whose sequences vary per mucin, yet each extensively encode serine and threonine residues for glycosylation42. The gene family can be broken up into two subgroups: tethered and secreted mucins. In tethered mucins, single VNTR domains contain variation in total motif copy number and motif usage (Fig. 5a). Secreted mucins harbor potential variation in VNTR domain copy number, VNTR motif copy number, VNTR motif usage and cysteine domain copy number43,44 (Fig. 5b). The presence of these repetitive sequences makes mucins both highly polymorphic and difficult to accurately sequence and genotype using short reads.

Fig. 5: Locityper can accurately genotype mucin and other gene families.
figure 5

a, Gene model of MUC1, a mucin tethered to the surface of epithelial cells. MUC1 harbors a 20-amino-acid VNTR repeat sequence and is highly polymorphic in VNTR length59, as represented by the example haplotypes 1–3. b, Gene model of MUC5B, a secreted, gel-forming mucin that is important for homeostasis in the lungs. MUC5B encodes an irregular 29-amino-acid VNTR motif that is broken up into separate VNTR domains by cysteine domains. The number of VNTR domains, cysteine domains and VNTR motifs could each contribute to polymorphism among haplotypes at this locus60. c, Difference in average haplotyping accuracy (QV) between Locityper and the 1KGP call set at 15 mucin genes based on 39 Illumina WGS datasets. Improvement for the LOO setting and the full Locityper database are shown using dark and light shades, respectively. Tethered and secreted mucins are shown in purple and green; the only non-gel-forming secreted mucin MUC7 is marked with an asterisk. d, Locityper (LOO) and 1KGP call set average genotyping accuracy (QV) across four gene families: CFH (orange); CYP2 (light green); FCGR (red); and MUC (blue). The diagonal black line shows the zero improvement boundary and the diagonal gray lines show a QV improvement of 10, 20 and 30. PTS, proline (P), threonine (T), serine (S).

Locityper leverages information about both read depth and read alignment for genotyping; therefore, the tool is well suited to characterizing mucin genetic variation. Based on 39 HPRC Illumina WGS datasets, Locityper (LOO) haplotypes achieved on average a 10.5 higher QV compared to the 1KGP call set across 15 examined MUC loci, with the largest improvement observed at MUC6 and MUC16 with 29.7 and 18.5 higher QV, respectively (Fig. 5c). The only negative QV difference between Locityper and 1KGP was observed at the non-gel-forming MUC7 gene, where the two haplotype sets showed very high QV values (43.5 and 44.2, respectively).

Further examples of genes that are challenging to address with standard calling techniques are FCGR2B and FCGR3A, encoding receptors for the Fc region of the IgG complexes45,46. IgG binding to FCGR2B induces the immune complexes of phagocytosis and endocytosis and thus establishes the basis of antibody production by B cells. The second receptor, FCGR3A, is expressed on natural killer cells as an integral membrane glycoprotein46 and has a central role in limiting viral load and viral propagation in a memory-like manner47. Genetic variations in both genes have been associated with systemic lupus erythematosus48 and other immune disorders49. However, genetic analyses of the FCGR genes using high-resolution short reads have been notoriously difficult because of recent gene duplication and diversification processes50. Nevertheless, at the FCGR2B and FCGR3A receptor genes, Locityper (LOO) improves the average QV by 4.95 and 9.3 points, respectively, compared to the 1KGP call set (23.0 to 27.9 and 20.3 to 29.6) (Fig. 5d). A larger reference panel would probably improve Locityper’s ability to genotype FCGR genes even further because the tool achieves much higher accuracy (35.6 and 54.0) when using its full reference panel.

Moreover, Locityper (LOO) achieves significant QV improvement (12.3) at the CFH gene, which is associated with age-related vision loss and kidney disorders51,52. Finally, Locityper showed on average a 4.6 higher QV across 16 protein-coding CYP2 genes that have a major role in drug metabolism53,54. Out of the CYP2 genes, Locityper achieved the highest improvement at CYP2U1 (10.2), CYP2A13 (10.8) and CYP2W1 (11.6) (Fig. 5d).

Runtime and memory usage

Locityper WGS preprocessing (executed once per dataset) took on average 16 min using eight threads and consumed 15 Gb of RAM for 30× Illumina WGS datasets. If a dataset with a similar library preparation was previously processed, read mapping can be skipped, which speeds up WGS preprocessing to under 3 min. The next step, read recruitment, can simultaneously identify reads for multiple target loci. Because reading and decompressing input data was the most time-consuming operation, recruitment speed did not depend on the number of loci (1–256 tested) and lasted under 15 min on average.

Next, mapping reads to the reference panels across 256 target loci took under 19 min using eight threads; locus genotyping consumed another 45 min. Together, these two steps required approximately 15 s per target locus and 7 Gb of RAM. Locityper uses stringent haplotype filtering as the first genotyping step, allowing it to avoid quadratic runtime. Thus, full analysis based on five-times-larger reference panels (obtained by artificially mutating existing haplotypes) required only three times as much time (Extended Data Fig. 8).

Altogether, Locityper analysis of the MHC and KIR loci, including preprocessing, required 35 min using eight threads. However, genotyping in the weighted mode was more computationally intensive, raising the total runtime to 1 h and 5 min. At the same time, T1K with eight threads required on average 2 h and 30 min and 48 min to process the MHC and KIR loci, respectively, and required 2.5 Gb of RAM. Pangenie calls variants across the whole genome; consequently, it had a heavier runtime and memory footprint: at 24 threads, its pangenome indexing (executed once) and genotyping steps took 34 min and 1 h and 40 min, respectively, and consumed 60 and 37 Gb of RAM.

In addition to unmapped data, Locityper and T1K can efficiently use mapped reads (in BAM/CRAM format for Locityper and BAM format for T1K) by only recruiting reads aligned to the regions of interest or to alternative contigs, as well as unmapped reads. Additionally, by examining existing alignments, Locityper can preprocess WGS datasets almost immediately. Overall, this decreases T1K runtime to 45 min and 23 min for the HLA and KIR loci, respectively, and speeds up the full Locityper pipeline for these genes to 10 min.

Discussion

In this study, we present Locityper, a targeted method for genotyping complex polymorphic genes using both short-read and long-read WGS. Locityper implements fast read recruitment to a collection of target loci, and uses a carefully balanced probabilistic model to calculate genotype likelihoods based on read alignment, insert size and read depth profiles. Locityper uses ILP or stochastic optimization to find the most likely genotype for each target locus. Locityper departs from the prevalent variant-centric approach, which we argue constitutes a particular limitation for highly polymorphic loci. In contrast, our approach leverages collections of known haplotype sequences, which can be extracted from a pangenome reference or directly provided by the user. By examining larger regions around genes of interest, Locityper inherently makes use of any available information, including the intergenic sequence, gene order, SVs and copy number of short tandem repeats. Locityper is easy to install via Docker, Singularity or Conda, only requires easy-to-obtain input files, and has a small memory footprint and significantly shorter runtime than both T1K and Pangenie.

We demonstrated Locityper’s accuracy through excellent agreement to both phased genome assemblies and Mendelian consistency across the 563 family trios included in the 1KGP cohort. When evaluated across a wide range of challenging disease-associated genes, Locityper produces significantly more accurate haplotype predictions compared to state-of-the-art phased variant calling pipelines on Illumina and PacBio HiFi data. Locityper’s accuracy remains consistently high across several input sequencing technologies, performing well for Illumina, simulated short reads, PacBio HiFi and ONT datasets.

At present, the size of the available collections of reference haplotypes still poses a limitation: overall, 33% haplotypes did not have a good representative (QV < 33) in the LOO reference panels (Fig. 2d). Therefore, despite Locityper’s ability to predict haplotypes close to the best available, the resulting accuracy is not yet ideal for all genes of interest. Significantly larger pangenomes are presently being constructed by the HPRC15 and we are confident that these future pangenomes will lead to a significant increase in performance on out-of-sample individuals for more complex polymorphic genes. Even now, Locityper outperforms the specialized genotyper T1K across HLA and KIR genes in a LOO setting and shows improved ability to genotype other medically relevant gene families (for example, MUC and FCGR) using short-read WGS.

As part of this study, we used Locityper to process 3,202 Illumina WGS datasets from the 1KGP and make the obtained genotypes available, which provides a resource for deeper analyses of 256 challenging target loci. Additionally, publicly available Locityper-preprocessed WGS summaries will allow for faster genotyping of genes that were not a focus of this study across the 1KGP cohort. We envision that Locityper will enable the inclusion of complex loci in GWAS55 and PheWAS56 analyses, especially in a larger cohort, such as the All-of-Us program25 and the UKB26, which promises to discover many new associations and explain missing heritability. Of note, Locityper’s ability to process both short and long reads might prove especially useful for the increasing production of long reads in the context of biobank-scale sequencing efforts.

For a given locus, Locityper aims to find two existing haplotypes that would explain an input WGS dataset in the best way. Consequently, it is not designed to reconstruct a new haplotype, even if it constitutes a mixture of already known haplotypes. To address this, Locityper outputs read alignments to the top predicted genotypes, which can be used later for visual analysis or variant calling. Combined with assembly polishing57, this could improve genotyping accuracy and allow for the reconstruction of previously unobserved alleles, a strategy that we plan to explore in future research.

Currently, two loci with significant homology, for example, part of a non-tandem segmental duplication, can only be processed independently, with potentially overlapping sets of recruited reads. Locityper mitigates this problem by tracking the number of off-target k-mers per read and haplotype window. Nevertheless, further improvements are conceivable, such as using a shared pool of reads for related loci, like the strategy implemented by T1K23.

In conclusion, Locityper allows for fast and accurate targeted genotyping of challenging polymorphic loci using several sequencing technologies. With the current draft pangenome containing highly accurate phased genome assemblies, Locityper routinely achieves sequence accuracies above a QV of 33, which is comparable to genome assemblies from Oxford Nanopore data29. As more human haplotypes are represented in pangenomes, we expect the accuracy to improve further, which will facilitate detailed analysis of previously intractable genes, leading to improved diagnostic power and new disease associations.

Methods

In this article, we present a targeted tool, Locityper, designed for genotyping complex multiallelic loci. Locityper processes WGS data produced by several different sequencing technologies, including highly accurate short and long reads (such as Illumina and PacBio HiFi data, respectively), as well as error-prone long reads, such as PacBio CLR and Oxford Nanopore data. Locityper can efficiently analyze unmapped reads stored in various formats, as well as mapped reads from sorted and indexed BAM and CRAM files.

Broadly, the method can be split into several steps: (1) preprocessing of target loci; (2) sample preprocessing, performed once for each WGS dataset; (3) read recruitment, carried out simultaneously for multiple loci; and (4) locus genotyping and generation of BAM files with alignment to the best genotypes.These steps are described in more detail in the following sections.

Preprocessing of target loci

Locityper uses solely locus haplotype sequences and does not require any kind of additional graph structure. Locus haplotypes can be provided directly in a FASTA file. Alternatively, Locityper can automatically extract locus haplotypes from a pangenome, provided in variant calling format (VCF) (constructed, for example, using Minigraph-Cactus61).

When locus haplotypes are extracted from a VCF file, Locityper tries to extend the locus in such a way that both locus ends do not overlap any pangenomic variation. Additionally, the tool tries to select a position that would produce the largest number of unique canonical k-mers at the edges of the locus (default edge size = 500 bp). In the default configuration, locus extension is limited to 50 kb at each side, but can fail if there is a longer SV at the locus boundary. In such cases, the user can either increase the allowed extension size or set the boundaries manually.

Finally, Locityper finds off-target k-mer multiplicities, calculated as the difference between canonical k-mer counts across the full reference genome (calculated using Jellyfish62 with recommended k = 25) and the corresponding k-mer counts at the reference locus sequence.

WGS dataset preprocessing

Locityper aims to probabilistically describe three features of a given WGS dataset, that is, insert size, error profile and read depth, by examining read alignments to a predefined background region. For human WGS data, we used a 4.5 Mb interval on chromosome 17q25.1 as the default background region because it contains almost no segmental duplications or other types of structural variations. Locityper first recruits input reads to the background region (see ‘Read recruitment’), optionally subsamples them and then maps them to the reference genome using Strobealign63 (short reads) or Minimap2 (ref. 64) (long reads).

Insert size

Manual examination of several paired-end WGS datasets from the HPRC project15 indicated that the negative binomial (NB) distribution fits insert size distribution the best (Extended Data Fig. 9). For a given WGS dataset, we used all fully mapped read pairs (clipping less than 2% of the read length, by default) with high mapping quality (≥20). We removed outliers by defining the maximum allowed insert size as three times the 99th percentile of the observed insert sizes, and discarded violating read pairs. Finally, we obtained the NB distribution parameters using the method of moments. During the next two preprocessing steps, we only used read pairs with insert sizes within the 99.9% confidence interval of the corresponding NB distribution.

Error profile

We used two distributions to describe the WGS error profiles. First, we used the beta binomial (BB) distribution to evaluate the edit distance based on read length. The distribution was fitted using the maximum likelihood estimation based on the remaining read pairs. The obtained BB distribution was used to distinguish between true and off-target alignments at the genotyping stage.

Second, we calculated match, mismatch, insertion and deletion rates (PM, PX, PI, PD, respectively) and defined the alignment likelihood as the product of the corresponding rates to the power of the number of operations. For example, alignment with 100 matches, one mismatch and two insertions would have a likelihood of \({P}_{M}^{\,100}\times {P}_{X}^{\,1}\times {P}_{I}^{\,2}\times {P}_{D}^{\,0}\). Note that the probabilities do not sum up to one and are incomparable between reads of different lengths. Nevertheless, this formulation produces fast-to-calculate probabilities and provides a way to numerically compare different alignments of the same read.

Read depth

We split the background region into windows of fixed size based on the mean read length and assigned reads to windows based on the middle of the corresponding read alignments. Next, we counted the number of primary read alignments assigned to each window. (Only first mates were counted to preserve window independence.)65

For each window, we calculated the guanine-cytosine (GC) content and the fraction of unique k-mers in an area centered around the window. Next, we selected windows with many unique k-mers (≥90%) and estimated the mean read depth and variance across various GC content values using local polynomial regression66. NB parameters were then estimated separately for each GC content based on the smoothed mean, variance and subsampling rate (Supplementary Information).

Read recruitment

After dataset preprocessing, Locityper recruits reads to all target loci. For that, we collected minimizers67 from each locus and each haplotype (default: (10,15)-minimizers). Uninformative minimizers, which appear five or more times off-target, were ignored. Locityper compares read and target minimizers in parallel and recruits reads to one or several loci according to one of the following rules: short reads are recruited if a sufficient fraction of minimizers matches the target for all read ends (default: 0.7 and 0.5 for single-end and paired-end reads). Lower match fraction values lead to an increased number of unnecessarily recruited reads, which increases read mapping runtime but does not significantly affect genotyping accuracy because false positive reads are discarded at a later stage.

Only a small part of a long read may overlap a given target locus. Consequently, we recruited a long read if it contained a subregion with sufficiently many minimizer matches. For that, we used the following heuristic: matching and mismatching informative minimizers are assigned s+/s scores (default: +3/−1); a read was recruited if it had a continuous subsequence, with a sum score greater or equal to

$$\left\lceil\,2L \frac{M({s}_{+}-{s}_{-})+{s}_{-}}{{m}_{w}+1}\,\right\rceil$$
(1)

where L is the subregion length (default: 2,000 bp), M is the match fraction (default: 0.5) and 2L/(mw + 1) is the expected number of (mw, mk) minimizers per L base-pair sequence68. This heuristic is useful because it can be quickly evaluated using Kadane’s algorithm69 and is not too restrictive: shorter read subregions with a higher match rate may produce a hit, and vice versa.

Genotype likelihood

Read location probabilities

After read recruitment, every target locus was genotyped independently from other loci. Reads, recruited to the locus, were aligned to all haplotypes H using either Strobealign63 or Minimap2 (ref. 64), depending on the read type. The obtained read alignments were assigned BB P values according to their edit distances and read lengths. A read pair was retained if both read ends had at least one good alignment (P ≥ 0.01) to at least one of the haplotypes (approximately 3% read pairs discarded per locus). All alignments with BB P < 0.001 were discarded.

Without loss of generality, we describe the following steps for paired-end reads and use notation r = (r1, r2) to describe a read pair. Each locus haplotype h H was split into nonoverlapping windows W(h) of fixed size (same size as in read depth preprocessing); furthermore, we expanded W(h) by adding a null window w. Each alignment is connected to a single window w based on the middle point of the alignment, with alignment probability \({{P}}({r}_{\!j},w)\) calculated according to the precomputed error profile. Reads without proper alignment to h are connected to the null window w; we defined P(rj, w) as \(\Lambda \cdot \mathop{\max }\limits_{h}{{P}}({r}_{\!j},h)\), that is, the probability of the best rj alignment to any haplotype, multiplied by a penalty Λ (10−5 by default).

The paired-end alignment probability of the read pair r = (r1, r2) to windows w = (w1, w2) can be written as \({{P}}({\mathbf{r}},{\mathbf{w}})={{P}}({r}_{1},{w}_{1})\times\)\({{P}}({r}_{2},{w}_{2})\times {P}_{{\rm{insert}}}({\mathbf{r}},{\mathbf{w}})\), where the last term is calculated according to the precomputed insert size distribution. For the null windows, we defined insert size probability as the highest probability achievable under the precomputed insert size distribution. Thus, the insert size between a read end and its unmapped counterpart is assumed to be optimal to only penalize unpaired locations once. Finally, we denoted the full set of possible read pair locations on haplotype h as L(h) W(h) × W(h) and defined the probability of the read pair r location to be w as the normalized alignment probability:

$${\mathcal{P}}_{{\mathbf{rw}}}=\frac{{\mathcal{P}}({\mathbf{r}},{\mathbf{w}})}{{\sum }_{{h}^{{\prime} }\in H}{\sum}_{{\mathbf{u}}\in {L}^{({h}^{{\prime} })}}{\mathcal{P}}({\mathbf{r}},{\mathbf{u}})}$$
(2)

Some parts of the target loci can have high homology to other genomic regions. Consequently, we downgraded the effect of potentially misrecruited reads by setting equal probabilities to all locations for read pairs with fewer than five target-specific k-mers.

Read assignment

Without loss of generality, let us consider a diploid genotype g = (h1, h2). We combined windows across the two haplotypes \({W}^{({\mathbf{g}})}={W}^{({h}_{1})}\cup {W}^{({h}_{2})}\). If h1 = h2, we used two copies of each window, such that W(g) is always \(| {W}^{({h}_{1})}| +| {W}^{({h}_{2})}|\). Similarly, we concatenated possible locations \({L}^{({h}_{1})}\) and \({L}^{({h}_{2})}\) to achieve a combined list of locations L(g).

We described read assignment to the genotype g using a Boolean matrix T, where Trw = 1 encodes the statement ‘true location of the read pair r is w’ and every row contains exactly one true element. Probability of the read assignment T given read pairs R can be described as the total probability of all selected locations:

$$P(T\,| \,R)=\prod _{{{r}}\,\in \,R}\sum _{{\mathbf{w}}\in {L}^{({\mathbf{g}})}}{T}_{{\mathbf{rw}}}\cdot {{\mathcal{P}}}_{{\mathbf{rw}}}$$
(3)

Read depth likelihood

In addition to good alignment probabilities, optimal haplotypes should have stable haploid read depth. The corresponding conditional probability can be written as:

$$P\left({\rm{CN}}({\mathbf{g}})=1\,| \,T\;\right)\,=\prod _{w\in {W}^{\;({\mathbf{g}})}}P\left({\rm{CN}}(w)=1\,| \,{d}_{w}(T\;)\right)$$
(4)

In this equation, dw(T) denotes the window w depth according to the read assignment T, defined as \({\sum }_{{\mathbf{r}}}{\sum }_{u}\left[{T}_{{\mathbf{r}},wu}+{T}_{{\mathbf{r}},uw}\right]\). At CN = 1, read depth follows the NB distribution with the precomputed parameters n and ψ. Bayes’ theorem with equal priors produces the following result:

$${\varphi }_{w}(T\;)\,=\,P\left({\rm{CN}}(w)=1\,| \,{d}_{w}(T\;)=d\right)\,=\,\frac{{\rm{NB}}\left(d;\,n,\psi \right)}{{\sum }_{c\in \{1\}\cup {C}_{{\rm{alt}}}}{\rm{NB}}\left(d;\,cn,\psi \right)}$$
(5)

where alternative hypotheses are represented by a set Calt. We found it beneficial to use Calt = {0.5, 1.5}; in other words, a half divergence from the expected read depth was considered significant. As unmapped reads are already penalized by low alignment probabilities P(r, w), we defined P(CN(w) = 1 | d) for any read depth d.

Window and read weights

Low-complexity regions, and short and long repeats, evoke difficulties in read sequencing, recruitment and alignment. To assign window weights in a continuous fashion, we defined the following two parametric function ϑ: [0, 1] [0, 1] as:

$$\vartheta (x;\eta ,q)\,=\,\left\{\begin{array}{ll}0\quad &{\rm{if}}\,x=0,\\ \frac{1}{{\left(\frac{\eta }{x}\times \frac{1-x}{1-\eta }\right)}^{q}+1}\quad &{\rm{otherwise}}\end{array}\right.$$
(6)

ϑ exhibits several useful properties: it is a strictly increasing smooth function such that ϑ(0) = 0 and ϑ(1) = 1. The location parameter η (0, 1) defines the break point ϑ(η; η, q) = 1/2 q, while the power parameter q controls the slope of the function, with larger q producing larger derivative \({\vartheta }^{{\prime} }(\eta ;\eta ,q)\) (Extended Data Fig. 10). Finally, we defined window w weight ζw = ϑ(x1; η1, q1) × ϑ(x2; η2, q2) based on the fraction of the locus-specific k-mers x1 and linguistic sequence complexity x2 = U1U2U3, where Ui is the fraction of unique i-mers in window w of the maximal possible number of distinct i-mers70, with the default parameters η1 = 0.2, η2 = 0.5 and q1 = q2 = 4.

Locityper accepts explicit user-defined weights for each base pair of the input haplotypes, useful, for example, for downweighting noncoding sequence. In such cases, ζw is multiplied by the average weight across window w, while each read receives its own weight based on the maximum explicit weight under the primary alignments of both read ends. After that, read weights are used as multipliers to log-location-probabilities.

Combined likelihood and likelihood update

Not accounting for window weights, combined likelihood for a genotype g and read assignment T can be calculated as:

$$\begin{array}{rcl}P\left({\rm{CN}}({\mathbf{g}})=1,T\,| \,R\right)\,&=&P\left(T\,| \,R\right)\,\times \,P\left({\rm{CN}}({\mathbf{g}})=1\,| \,T\;\right)\\ &=&\prod\limits_{{\mathbf{r}}\,\in \,R}\sum\limits_{{\mathbf{v}}\in {L}^{({\mathbf{g}})}}{T}_{{\mathbf{rv}}}\cdot {{\mathcal{P}}}_{{\mathbf{rv}}}\,\times \,\prod\limits_{w\in {W}^{({\mathbf{g}})}}{\varphi }_{w}(T\;)\end{array}$$
(7)

Next, we moved the calculations to log-space, added window weight ζw and introduced the contribution factors ΩR, ΩD ≥ 0, which represent the relative importance of read alignment and read depth likelihoods, respectively. Then, the log-likelihood \({\mathcal{L}}\) can be written as:

$${{\mathcal{L}}}_{T}^{({\mathbf{g}})}\,=\,{\Omega }_{R}\sum _{{\mathbf{r}}\,\in \,R}\sum _{{\mathbf{w}}\in {L}^{({\mathbf{g}})}}{T}_{{\mathbf{rw}}}\log {{\mathcal{P}}}_{{\mathbf{rw}}}\,+\,{\Omega }_{D}\sum _{w\in {W}^{({\bf{g}})}}{\zeta }_{w}\log {\varphi }_{w}(T\;)$$
(8)

The contribution factors ΩR and ΩD are necessary because read alignments can overshadow read depth due to the large number of read pairs and large differences between several read alignments. The factors should sum up to two to generate the same range of likelihoods as in the unweighted case (ΩR = ΩD = 1). We used the default values ΩR = 0.15 and ΩD = 1.85 because they produced good results across a selection of target loci and sequencing datasets. When needed, users can provide custom Ω values to adjust read alignment and depth balance for specific loci of interest to achieve optimal accuracy.

Likelihood update

Given the \({{\mathcal{L}}}_{T}^{({\bf{g}})}\) log-likelihood for genotype g and some read assignment T, we can efficiently calculate the \({{\mathcal{L}}}_{{T}^{{\prime} }}^{({\bf{g}})}\) log-likelihood for a new read assignment \({T}^{{\prime} }\) if the read assignment has changed for only one read pair. Suppose that the read assignment changed for read pair r from location uv (in T) to \({u}^{{\prime} }{v}^{{\prime} }\) (in \({T}^{{\prime} }\)). Then, the read depth likelihood values \({\varphi }_{w}({T}^{{\prime} })\) will be identical to φw(T) for all windows except for \(u,v,{u}^{{\prime} },{v}^{{\prime} }\), where read depth can be recomputed quickly. This way, the log-likelihood can be recalculated in constant time:

$$\begin{array}{rcl}{{\mathcal{L}}}_{{T}^{{\prime} }}^{({\bf{g}})}\,&=&\,{{\mathcal{L}}}_{T}^{({\bf{g}})}+\,{\Omega }_{R}\cdot \left(\log {{\mathcal{P}}}_{{\bf{r}},{u}^{{\prime} }{v}^{{\prime} }}-\log {{\mathcal{P}}}_{{\bf{r}},uv}\right)\\ &+&{\Omega }_{D} \sum\limits_{w\in \{u,v,{u}^{{\prime} },{v}^{{\prime} }\}}{\zeta }_{w}\cdot \left(\log {\varphi }_{w}({T}^{{\prime} })-\log {\varphi }_{w}(T\;)\right)\end{array}$$
(9)

Finding the best read assignment

For each genotype g, we aimed to find such read assignment T that would maximize the joint \({{\mathcal{L}}}_{T}^{({\bf{g}})}\) log-likelihood. Locityper implements three approaches for finding such read assignment: stochastic greedy approach71; simulated annealing72; and ILP73. The first two algorithms start from an arbitrarily generated read assignment T, then iteratively select a random read pair r and switch its location if it increases the genotype likelihood. In addition to good location switches, simulated annealing permits bad switches (decreasing the overall likelihood), gradually restricting the frequency of such events.

In an ILP formulation, we introduced two sets of unknowns: xrw {0, 1} for each read pair r and each location w L(g); and ywd {0, 1} for each window w W(g) and each possible window depth d between zero and the maximal possible read depth (Dmax). The problem can be written as follows:

$$\begin{array}{rcl}{\rm{Maximize}}\quad \sum\limits_{{\bf{r}}\in R}\sum\limits_{{\bf{w}}\in {L}^{({{g}})}}{x}_{{\bf{rw}}}\cdot {\Omega }_{R}\log {{\mathcal{P}}}_{{\bf{rw}}}&+&\sum\limits_{w\in {W}^{({\bf{g}})}} \sum\limits_{d=0}^{{D}_{\max }}{y}_{wd}\cdot {\Omega }_{D}{\zeta }_{w}{\varphi }_{w}(d\;)\\ {\rm{Subject}}\,{\rm{to}}\quad \qquad \qquad \qquad \sum\limits_{{\bf{w}}\in {L}^{({\bf{g}})}}{x}_{{\bf{rw}}}&=&1\quad \forall {\bf{r}}\in R,\\ \sum\limits_{d=0}^{{D}_{\max }}{y}_{wd}&=&1\quad \forall w\in {W}^{({\bf{g}})},\\ \sum\limits_{{\bf{r}}\in R}\sum\limits_{u\in {W}^{({\bf{g}})}}\left({x}_{{\bf{r}},wu}+{x}_{{\bf{r}},uw}\right)-\sum\limits_{d=0}^{{D}_{\max }}d\cdot {y}_{wd}&=&0\quad \forall w\in {W}^{({\bf{g}})}\end{array}$$
(10)

Note that we can remove variables xr for trivial read pairs, which map to only one possible location; at the same time, the number of possible read depth variables yw is exactly one more than the number of nontrivial read pairs mapping to w. Finally, the sum \({\sum }_{{\bf{r}}\in R}{\sum }_{u\in {W}^{({\bf{g}})}}\) in the third constraint can be limited to windows and read pairs relevant to window w. Locityper uses two commercial ILP solvers, both available under academic licenses: HiGHS74 and Gurobi (www.gurobi.com). Note that it is possible to state a bigger ILP problem by removing the need to iterate over all possible genotypes (Supplementary Information). However, we observed that existing ILP solvers are unable to quickly and accurately find a solution to such a problem.

Locus genotyping

To find the best locus genotype for the input WGS data, Locityper finds the best read assignment and the corresponding genotype likelihood for each possible locus genotype (Fig. 1). To speed up the process, we started by calculating the log-likelihood in the absence of read depth (ΩD = 0), which can be efficiently computed by assigning every read to its most probable location. Then, we used heuristic filtering by removing all genotypes whose likelihood is 10100 smaller than the best likelihood (the first 500 genotypes are kept regardless of the likelihood). For all remaining genotypes, the best read assignment is found using one of the three approaches described above. Even though the ILP solvers typically find better read assignments, we used simulated annealing as the default solver because it produces decent read assignments in a fraction of the ILP solving time.

Splitting locus haplotypes into nonoverlapping windows is an intrinsically discrete process. Furthermore, windows can be shifted across different haplotypes because of the presence of indels. Consequently, identical read depth profiles may produce varying read depth likelihoods depending on the window boundaries. To reduce this effect, we performed a procedure similar to noise injection regularization75, where we randomly moved read alignment centers to either direction and reassigned reads to windows. In addition, we redefined the window GC content values and ζw weights as if the window was randomly moved (the actual window boundaries stay fixed). In a default configuration, read and window movement is limited to half-window size or 200 bp, whichever is smaller. Repeating noise injection several times (20 by default), together with the stochastic nature of likelihood maximization, produces a distribution of log-likelihoods for each genotype.

Finally, Locityper selects a primary genotype with the highest average log-likelihood and calculates its Phred quality27 based on the probability of error: the probability that the true log-likelihood of any other genotype is higher than the true log-likelihood of the primary genotype, calculated using a one-sided Welch’s t-test76. Additionally, we redefined genotype probabilities as the probability of having the highest true likelihood, calculated as the product of inverse t-test P values for all pairwise genotype comparisons.

Moreover, Locityper outputs the number of unexplained reads, which map to some but not to the two predicted haplotypes. Finally, Locityper outputs a weighted Jaccard distance between the minimizers of the primary genotype and other probable genotypes. In an unambiguous prediction, this value should be low because all likely genotypes should be similar to each other. Users can use these values for conservative post-genotyping filtering, for example, in the HiFi-based LOO evaluation; discarding 20.2% genotypes with over 50 unexplained reads raises the median QV from 36.9 to 38.2.

Locus selection

To create a set of target loci, we started with 273 CMR genes7. We expanded gene coordinates to a minimum of 10 kb, when needed, and supplied positions as input to Locityper locus preprocessing, allowing an additional coordinate expansion by at most 300 kb to each of the sides (add −e 300k). At this stage, eight genes (ATPAF2, CLIP2, GTF2I, GTF2IRD2, IGHV3-21, MRC1, NCF1 and SMN1) were discarded because at least one the gene ends was contained in a 300-kb-long pangenomic bubble. Afterwards, we removed redundant loci (completely contained in another locus), which produced a final set of 256 loci, containing 265 CMR genes. In similar fashion, we added 33 loci covering genes from the MHC and KIR gene clusters, and 31 loci covering the MUC, CFH and CYP2 genes. Even though the reference panels were constructed based on 90 haplotypes from whole-genome-phased assemblies8, on average around 80 unique haplotypes were reconstructed per locus, as some haplotypes are not unique while others are only partially assembled (Supplementary Table 1). The number of discarded haplotypes significantly correlated with genotyping accuracy: the median Locityper QV for the PacBio HiFi datasets had Spearman’s ρ = 0.67 with the number of duplicate haplotypes (P < 2.2 × 10−16) and ρ = −0.24 with the number of unassembled haplotypes (P = 7.5 × 105).

Data used in the study

Pangenome reference in VCF was downloaded from https://s3-us-west-2.amazonaws.com/human-pangenomics/pangenomes/freeze/freeze1/minigraph-cactus/hprc-v1.1-mc-grch38/hprc-v1.1-mc-grch38.raw.vcf.gz. Illumina, PacBio HiFi and Oxford Nanopore data for the HPRC samples can be found at https://s3-us-west-2.amazonaws.com/human-pangenomics/index.html?prefix=working. NYGC variant calls for the 1KGP samples were downloaded from http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/1000G_2504_high_coverage/working/20220422_3202_phased_SNV_INDEL_SV. The 3,202 1KGP Illumina datasets are available on the European Nucleotide Archive under accession nos. PRJEB31736 and PRJEB36890.

Simulated Illumina data were constructed using ART Illumina30 v.2.5.8 with the parameters -ss HS25 -m 500 -s 20 -l 150 -f 15 for all phased haplotype assemblies from the HPRC project, which can be found on Zenodo https://doi.org/10.5281/zenodo.5826274.77

Benchmarking Locityper

To evaluate haplotyping accuracy, we computed full-length alignments between actual and predicted haplotypes using the Locityper align module. Internally, it finds the longest common subsequence of k-mers using LCSk++78 and completes the alignment between k-mer matches using the wavefront alignment algorithm79,80. Three k-mer sizes are tried (25, 51 and 101); an alignment with the highest alignment score is returned.

Afterwards, we calculated the haplotyping error, that is, the sequence divergence between two haplotypes, calculated as the ratio between edit distance Δ and alignment size S (edit distance plus the number of matches). As actual and predicted genotypes consist of two haplotypes, there are two possible actual–predicted haplotype pairings. Of the two options, we selected a pairing that produces a smaller ratio between sum edit distance and sum alignment size.

Then, we used Phred-like transformation of haplotyping error \({\rm{QV}}=-10\times {\log }_{10}(\Delta /S)\) to obtain the haplotyping QVs28,81. However, when two haplotypes are completely identical (Δ = 0), QV becomes infinite, which poses problems for average QV calculation. For that reason, we corrected the QV definition:

$${\rm{QV}}=-10\times {\log }_{10}\left(\frac{\max \{\Delta ,1/2\}}{S}\right)$$
(11)

This way, the QV difference between edit distances 0 and 1 is the same as between 1 and 2, and equals to \(10\times {\log }_{10}2\approx 3\). Constants smaller than 1/2 were generally even more beneficial for Locityper benchmarking.

We considered a trio of locus genotypes concordant if one of the child haplotypes closely matches one of the maternal haplotypes and another closely matches one of the paternal haplotypes. Like the haplotyping error calculation, we iterated over eight possible combinations; selected one with the smallest sum edit distance divided by the sum alignment size; and calculated the QV score for each of the child haplotypes.

To compare Locityper with state-of-the-art PacBio HiFi pipelines, we obtained existing8 unphased DeepVariant32 v.1.1.0 single-nucleotide polymorphism and indel calls for the PacBio HiFi HPRC datasets, which we phased using WhatsHap82 phase v.2.3. Next, we used the WhatsHap haplotag to assign reads to haplotypes and used Sniffles31 v.2.4 to generate phased SVs. Finally, we used RTG83 vcfmerge v.3.12.1 to generate the merged Sniffles + DeepVariant call set.

We used the Bcftools84 v.1.21 consensus to reconstruct haplotypes from each of the three phased variant call sets (Sniffles, Sniffles + DeepVariant and 1KGP3). In the process, we removed contradicting overlapping variant calls, and variants with symbolic alternative alleles (with exception of <DEL>) because they cannot be used for haplotype reconstruction.

Finally, we used T1K23 v.1.0.5 with the presets hla-wgs --alleleDigitUnits 15 --alleleDelimiter: and kir-wgs with all other parameters set to default. Ground-truth HLA and KIR annotation for the HPRC assemblies were obtained with Immuannot85 using the allele databases35,86 IPD-IMGT/HLA v.3.55 and IPD-KIR v.2.13. If a haplotype contains a new gene allele, Immuannot may associate it with several existing alleles. In such cases, we evaluated the predicted allele according to the best-matching existing allele.

In all evaluations, we used Locityper v.0.18.0 along with its dependencies SAMtools84 v.1.21, Jellyfish62 v.2.2.10, Strobealign63 v.0.13.0 and Minimap2 (ref. 64) v.2.26-r1175.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.