Introduction

Runs of homozygosity (ROH), stretches of continuously homozygous segments in genomes, are widely observed in various species, including humans. The degree of homozygosity within individual genomes can be altered by a combination of factors, including population history, which is characterized by migration patterns, demographic shifts, population bottlenecks, and cultural practices such as endogamy or consanguinity. Japanese population is known for its unique genetic background, shaped by geographic isolation and limited gene flow, which makes it particularly relevant for the assessment of genetic homogeneity and the detection of both distant and recent inbreeding. Previous studies demonstrated that modern Japanese and ancient Jomon individuals exhibit a relatively high average total length of ROH, especially in shorter categories (≤500 KB) [1,2,3]. However, since ROH analyses in the modern Japanese have so far been conducted on a small scale with low-coverage whole genome sequencing (WGS) and array data, there remains a lack of comprehensive insights into ROH patterns based on large-scale, high-coverage data in the Japanese population.

Currently available methods for estimating homozygosity level have been advanced during the past decades. Traditionally, the inbreeding coefficient [4], F, was used to estimate the proportion of the homozygous segments from pedigree information [5]. Microsatellites, known as short tandem repeats (STRs), have been also used to directly seek the consecutive homozygous genotypes [6]. Subsequently, significant advancements in SNP-genotyping microarray technologies allowed researchers to interrogate genomic regions with an elevated degree of homozygosity with higher accuracy [2]. In former years, most studies have focused on megabase-scale ROHs due to their association with inbreeding. However, ROHs can affect human traits and diseases beyond the context of inbreeding, making it important to consider shorter ROHs. Accordingly, shorter ROHs have been extensively studied in recent years, but in such cases, it was questionable whether SNP-array data, even if it is high-density, can detect very short ROHs accurately [3]. However, with the emergence of high-throughput next-generation sequencing (NGS) technologies, the detectability of ROHs has been continuously increasing, thereby allowing for the identification of shorter ROHs to a greater extent.

ROH segments in the genome can be investigated by different ROH detection tools, each based on a different approach. PLINK [7] scans chromosomes for consecutive homozygous genotypes by sliding a fixed-size window of detection, and an ROH is called if the count of consecutive homozygous SNPs satisfies the predefined condition. However, its algorithms were initially designed for SNP genotyping array data, and hence, it is necessary to adjust some of its parameters when applied to other data types [3, 8]. Alternatively, several model-based programs, including Beagle, H3M2, and BCFtools, which are employing hidden Markov models (HMM), can identify potential sequences of homozygosity [9,10,11]. Among the available tools, PLINK has been extensively used in numerous ROH studies [12,13,14], providing a convenient mean of comparing SNP-array outputs across different study groups. However, a prior study by Narasimhan et al. [11] asserted that BCFtools can be applied in sequencing data with better accuracy.

An obvious concern is that NGS may introduce sequencing errors [15, 16], which could potentially affect the accuracy of ROH identification [3]. The precise origins of such erroneous calls have been widely studied, indicating that they can arise at any stages of the sequencing process, ranging from sample handling and genomic library preparation to intrinsic errors of sequencing platforms [17, 18]. Some sequencing errors could be observed as Mendelian errors if pedigrees are known. Error correction processes such as removing sequencing errors which are not consistent with Mendelian inheritance would be effective for the identification of ROH. The impact of sequencing errors on ROH should not be neglected, yet only few studies address this issue. The accuracy of ROH identification can be enhanced to a certain extent by leveraging the extensively available pedigree data and removing sequencing errors through scanning Mendelian inconsistencies.

One of the surprising attributes of ROH is its vast degrees of variability in distributions across genomic regions, reflecting diverse patterns of inheritance, recombination, and population structure. Indeed, Ceballos et al. have shown that ROHs are not uniformly distributed across the human genome. Instead, ROHs tend to cluster within the specific genomic regions, forming ROH islands, the genomic locations of which can vary depending on ethnicities or genetic backgrounds [19]. Understanding the biological pathways of genes within ROH islands can provide comprehensive insights into the broader functional significance of these regions.

In this study, we applied two high-coverage whole-genome sequencing datasets: 3.5KJPNv2—a dataset constructed from a haplotype frequency panel of 3552 Japanese individuals [20], and BirThree—a dataset along with pedigree information of 1120 Japanese individuals, derived from Birth and Three-generation Cohort Study [21], which can allow us to more precisely identify shorter ROH segments by taking into account the effects of sequencing errors.

To illustrate the large advantages provided by WGS datasets, we investigated the detectability of ROHs down to very low minimum length, at the kilobase level, in the Japanese population. To further validate the detection of ROHs, we assessed the impact of SNP density by comparing all variant sites as well as trimmed SNP array-based sites in each dataset. We also evaluated the effects of sequencing errors on detection of ROHs by leveraging pedigree information. Genomic distribution of ROH segments and their functional impacts were also explored by identifying the ROH islands and pathways enriched within these regions.

Results

We investigated the detectability of ROH under different analytical approaches and assessed the SNP-density and functional effects, using biobank-level scale Japanese WGS datasets [20,21,22,23] and major bioinformatic tools [7, 11]. We showed that whole genome sequencing uncovers very short ROH segments that genotyping arrays fail to detect. Although long ROH segments may be affected by sequencing errors, integrating pedigree-based quality control into WGS data can help counteract these inaccuracies. Furthermore, our results revealed that using WGS data that incorporated pedigree information substantially enhances functional pathway enrichment within ROH islands.

The total numbers of ROH segments

First, we performed an analysis on 3.5KJPNv2 dataset to investigate the effects of marker density. We have compared length intervals of detected ROH, and for both BCFtools and PLINK, we observed a higher prevalence of ROHs between 100 KB and 1.5 Mb at all variant sites compared to the array-based sites (Fig. 1 and S1A–B). However, we noticed that longer ROHs (>1.5 Mb) are more abundant in array-based sites compared to all variant sites in both tools (Fig. 1).

Fig. 1
figure 1

Distribution of total numbers of ROHs in 3.5KJPNv2 dataset among all individuals. Bar graphs represent genome-wide all variant sites and OmniExpressExome array-based sites specific ROH distribution in 3.5KJPNv2 dataset based on the selected tools. “Het_1” denotes the use of default value 1 in PLINK “--homozyg-window-het”. Color scheme represents ROH segments length intervals: ROHs between 100 Kb and 1.5 Mb, and ROHs above 1.5 Mb

Next, we further investigated the mean number (NROH) and the mean cumulative sums (SROH) of ROH segments per individual. In this context, ROH minimal length thresholds were set starting from 100 Kb for shorter ROHs (i.e., ROH100, NROH100 and SROH100), and from 1.5 Mb for longer ROHs (i.e., ROH1500, NROH1500 and SROH1500) to facilitate comparison with previous research [12, 13, 24].

Mean number of ROH segments (NROH) and mean cumulative sums of ROH segments (SROH)

ROH > 1.5 Mb (3.5KJPNv2 dataset)

We set a minimal ROH length of 1.5 Mb to detect longer ROHs per-individual and investigated the NROH and SROH accordingly. We detected a consistent pattern, as seen in analysis of total numbers of ROH segments, where ROH1500 are more prevalent in the array-specific regions than in all variant sites by using both BCFtools and PLINK (Table 1, Fig. 2A). The default value for the PLINK parameter (--homozyg-window-het 1), which allows one heterozygous call per window, has been widely used in previous array-based studies on longer ROHs [3, 12, 13]. We applied this conventional parameter value to facilitate direct comparisons with previous research. Our results demonstrated strong concordance between results obtained using PLINK (mean NROH1500 of 10.94) and BCFtools (mean NROH1500 of 10.16) in detecting ROH1500 within array-based sites (Table 1, Fig. 2A).

Fig. 2
figure 2

Distribution of mean number of ROH1500 (NROH) per individual in (A) 3.5KJPNv2 dataset and (B) BirThree dataset. Violin plot represents the distribution of mean number of ROH segments longer than 1.5 Mb across individuals in the 3.5KJPNv2 and BirThree dataset. Color schemes represent specific conditions: genomic regions and parameter adjustments in selected tools. PLINK “--homozyg-window-het” option values were set to a range of 1–4, i.e., allowing from one to four heterozygous calls per window. These are abbreviated as “Het_1”, “Het_2”, “Het_3”, and “Het_4”, respectively

Table 1 Distribution of Mean of total number of ROH (NROH) and total sums of ROH (SROH) per individual in 3.5KJPNv2 dataset after implementing minimal ROH length threshold of 100 Kb, and 1.5 Mb, respectively

However, longer ROHs tend to be less common in highly dense variant regions than in sparse array sites. This is likely due to sequencing errors, which increase the probability of interruptions in consecutive homozygous genotypes. Ceballos et al. [3] suggested that aforementioned PLINK parameter value can be adjusted to 3 or 4 to allow for more heterozygous calls in WGS data to achieve comparability with array-based data. Therefore, we adjusted PLINK to allow specific number of heterozygous calls per window to account for the impact of sequencing errors in all variant sites, ensuring a reliable comparison with array-based sites. To match with the results observed in SNP array-based sites (mean NROH1500 of 10.94), four heterozygous calls were required to be allowed per window in all variant sites (mean NROH1500 of 10.06), which is in agreement with Ceballos et al.’s previous findings [3].

Indeed, when comparing the results between BCFtools and PLINK with adjusted parameters, we observed mean NROH1500 of 8.53 and SROH1500 of 22.7 Mbp for “Het_3” in PLINK, which can be comparable with findings for BCFtools (mean NROH1500 of 8.37 and SROH1500 of 23.5 Mbp). This result enables us to estimate the influence of sequencing errors in BCFtools detection (Table 1, Fig. 2A).

ROH > 1.5 Mb (BirThree dataset)

We next analyzed the dataset from the Tohoku Medical Megabank (TMM) Project BirThree Cohort (Table 2, Fig. 2B). In contrast to the 3.5KJPNv2 dataset, BCFtools detected more ROH1500 segments across all variant sites (mean NROH1500 of 10.44 and SROH1500 of 27.5 Mbp) than in the array-specific regions of the BirThree dataset. The heterozygous calls allowed per window were reduced to two when compared with PLINK across all variant sites (mean NROH1500 of 10.88 and SROH1500 of 29.1 Mbp), indicating greater robustness against sequencing errors in this dataset (Table 2, Fig. 2B).

Table 2 Distribution of Mean of total number of ROH (NROH) and total sums of ROH (SROH) per individual in BirThree dataset after implementing minimal ROH length threshold of 100 Kb, and 1.5 Mb, respectively

A comprehensive evaluation of ROH segments, comparing distributions of numerous ROH lengths between all variant sites and array specific sites showed smaller dissimilarities in the BirThree dataset (Fig. S4E–H).

ROH > 100 Kb

To further investigate these results, we have run BCFtooIs on ROH100 segments within all variant sites in BirThree dataset. As listed in Tables 1, 2, a significantly reduced number of ROH (mean NROH100 of 1713 and SROH100 of 505.9 Mbp) were detected in BirThree dataset compared to the 3.5KJPNv2 dataset (NROH100 of 2,190 NROH and SROH100 of 582.3 Mbp). Intriguingly, depending on the minimal lengths of ROH, we observed an opposite tendency in both datasets (Fig. S3A). Using PLINK instead, we have not observed significant differences between two datasets (Tables 1, 2).

To exclude the effects of SNP density, we randomly selected the same number of SNPs from the 3.5KJPNv2 dataset as in the BirThree dataset. We utilized BCFtools on the pruned 3.5KJPNv2 dataset again, applying the same differentiation methods for minimal ROH lengths as described above. Increased numbers of longer ROHs were observed when comparing BirThree (mean NROH1500 of 10.44) and the pruned 3.5KJPNv2 (mean NROH1500 of 8.82) datasets, while no such significant change was noted when comparing the pruned 3.5KJPNv2 and its unfiltered version (Table 3 and Fig. S3A, left). Unlike ROH1500, we observed a noticeable declining pattern in the detection of ROH100 in both comparisons, indicating the possible influence of SNP density, upon adjusting minimal length to 100 Kb (Table 3 and Fig. S3A, right). However, we did not observe significant differences in both ROH100 and ROH1500 when conducting analysis on only overlapped samples (Fig. S3B–E).

Table 3 Assessing the effects of SNP density after SNP-pruning on 3.5KJPNv2 dataset

ROH islands and functional analysis

To examine the specific genomic distribution of ROH segments and their functional impacts, we defined ROH islands by applying 99.9th or 99.5th percentile thresholds for the regions with shared ROH segments, depending on the minimum ROH length (Fig. S1C–G). Our analysis revealed enriched pathways related to three main gene families that are prominently involved in ROH island regions detected by BCFtools. First, the USP17 family of genes showed significant presence in pathways related to protein deubiquitination, proteolysis, and cell apoptosis. Second, several members of the TAS2R family (TAS2R14, TAS2R20, TAS2R30, TAS2R31, TAS2R43, TAS2R46, TAS2R50) were identified within ROH islands on chromosome 12 in the BirThree dataset and these were associated with taste receptor activity, taste transduction, and sensory perception of taste pathways (Fig. 3 and S2A, Table S2A). Significant enrichment of olfactory receptor activity, signaling, and transduction pathways was driven by multiple OR4 genes (OR4A47, OR4B1, OR4C3, OR4C5, OR4S1, OR4X1, OR4X2) detected within ROH islands located in longer ROH regions (>1.5 MB) (Fig. S2B and Table S2D–E). Additionally, ROH islands located in regions containing genes such as HADHA, HADHB, IP6K1, and IP6K2, which are involved in enzymatic activity linked to fatty acid metabolism and inositol phosphate synthesis were also identified by PLINK (Fig. 3 and S2A–B) (Tables S2B–C and F).

Fig. 3
figure 3

Functional enrichment analysis of annotated genes within runs of homozygosity (ROH) Islands. This figure presents the results of functional enrichment analysis on genes identified within ROH islands located in shorter ROH regions (>100 KB), detected in the BirThree dataset via BCFtools (by setting 99.9th percentile threshold based on the frequencies of overlapping ROH100 regions shared among individuals). The gProfiler tool was utilized to identify enriched biological pathways (BP), molecular functions (MF), and cellular components (CC) from Gene Ontology (GO), KEGG, and Reactome. The y-axis displays the enrichment score, indicating statistical significance, while the x-axis and color coding represent the data source. Dot size corresponds to the number of genes associated with each term. A summary of comparative statistics with other dataset (3.5KJPNv2) and tool (PLINK) is also provided. Full figures and statistics related to all analysis can be found in Supplementary Figs. S2A–B and Tables S2A–F

Additionally, we also examined the genomic-based inbreeding coefficient (FROH) to unravel the validity of our results, using the same method as previously described [13]. By doing this, we obtained the array-based sites result (FROH > 1.5 MB = 0.010064) consistent with that of a previous Clark et al. study, specifically for Japanese population (Table S1G). Similarly, upon stratification analysis, we validated a marked reduction in the number of very short ROH segments (200–500 KB) in array-based regions compared to all variant sites across both tools (Fig. S4A–H).

Next, we identified the common sites where Mendelian-inconsistent calls occurred in the BirThree dataset and then analyzed a subset of the 3.5KJPNv2 dataset with these sites excluded. We also altered the genetic map to the 1000 Genomes Project (1KGP) in BCFtools analysis. However, these latter two analyses did not provide significant differences (Table S1A–D).

Discussion

ROHs are distributed over a wide range of genomic regions in many species. Modern humans have substantially lower genetic diversity, i.e., an estimated effective population size of only about 10,000, compared with other species, and are thus expected to have many ROHs. In this study, we demonstrated that WGS data with a heightened SNP density in widespread genome regions can call many shorter ROH segments compared to array-specific regions. In addition, having used the BirThree dataset, we observed that leveraging pedigree information can mitigate sequencing error effects, especially on longer ROH segments, regardless of SNP-density effects and also demonstrate strong functional enrichment.

Initially, we performed ROH segment identification in the 3.5KJPNv2 dataset using PLINK and BCFtools. Our results demonstrate that a greater number of shorter ROH segments can be detectable in all variant sites compared to only array-specific sites, suggesting that WGS technology can enhance the detection of ROHs in many unexplored regions that cannot be identified by using array-based technologies. Previous studies indicated that the density of genetic variants and the quality of genotype calling may affect the detection of true ROHs [12, 25]. Array genotyping has been extensively used in numerous ROH studies [2, 5]. However, it should be applied with caution when predicting ROH segments, as SNP arrays yield sparse data consisting of a few million autosomal nucleotide positions at most.

We then extended the minimal length of ROH to 1.5 Mb to examine the similar impact in longer ROHs. In contrast, using BCFtools, we first noticed that the capability of detecting ROH1500 is reduced in all variant sites of the genome compared to array-specific sites. As an additional benchmark, we utilized PLINK to evaluate its concordance with the NROH and SROH results obtained through BCFtools, by adjusting its parameters. We allowed four heterozygous calls per window across all variant sites, to reach a close similarity in the results with the analysis of only array-specific sites. These results are nearly consistent with the findings in the previous study, which recommended setting the parameter to allow more than three heterozygous calls per window to achieve equivalent results between low-coverage WGS data and array data for a Japanese population [3]. Due to array data usage along with PLINK in previous studies, PLINK’s array-sites results can be considered as the most standard findings when detecting longer ROHs. This also allows us to compare our findings with previous studies [3, 12, 13, 24]. ROH1500 detected within array-specific sites between PLINK and BCFtools are comparable in our study, thus further supporting that BCFtools can achieve similar accuracy. However, it is expected that the involvement of sequencing errors in the 3.5KJPNv2 dataset may disrupt long ROHs, thereby reducing the ROH1500 detected by BCFtools in all variant sites. Moreover, such possibility of error presence can be corroborated by allowing three heterozygous calls per window by a parameter adjustment in PLINK.

Notwithstanding that the efficiency and effectiveness of WGS technologies are remarkable, applying the technologies in unrelated individuals may result in relatively high probability of error rates [26]. For that reason, we have analyzed another WGS dataset derived from the TMM BirThree Cohort Study, and which includes family-tree information that encompassed fathers, mothers, grandparents and children [21, 23]. Using BCFtools on the BirThree dataset, this time, more ROH1500 were identified from all the variant sites than from only array-based sites. Likewise, the allowance of heterozygous calls per window can be reduced from three to two in PLINK analysis, indicating better tolerance to sequencing errors. In addition, ROH100 were less detectable in the BirThree dataset, and we assume this could be established from the fact that shorter ROHs may be united into longer ones. Such a superior performance in detecting ROHs may arise from using the data in which sites with higher Mendelian error rates were preliminarily removed based on pedigree information as a pre-processing quality control step, which may have reduced sequencing errors that deviate from Mendelian inheritance. When we examined the distribution of ROH by size (length intervals) in both datasets, we found that applying the BirThree dataset can actually mitigate the differences between array-based and all variant sites, particularly on longer ROHs, and thereby strengthen the effectiveness of the BirThree dataset. However, to make correct assumption, the differences in SNP marker density between the 3.5KJPNv2 and BirThree datasets need to be considered extensively.

To accurately isolate the effects of sequencing errors from those of SNP marker density, we conducted additional analyses for assessing SNP density effects after making methodological refinements, specifically designed to equalize SNP density between the datasets, thereby ensuring that any resultant differences in the ROH analysis could be attributed solely to sequencing errors, not to variations in SNP density. Our analysis showed that there was no significant difference in the number of ROH1500 per individual between 3.5KJPNv2 and its pruned dataset, suggesting that SNP marker density does not substantially affect the detection of longer ROHs. Interestingly, the results for BirThree consistently showed an increased number of ROH1500 compared to results for both unfiltered and pruned 3.5KJPNv2. This indicates that despite its lower SNP density, the BirThree dataset might have effects in reducing sequencing errors that cannot be effectively removed in 3.5KJPNv2 dataset. However, this was markedly different for shorter ROHs. We observed a significant reduction in the number of ROH100 in the pruned 3.5KJPNv2 dataset compared to its unfiltered version. This points to the possibility that the sparser marker coverage in the pruned dataset may miss some homozygous regions and lead to fewer identified ROH100 overall, whereas the higher SNP density in the unfiltered dataset may be more likely to detect heterozygous sites within stretches of homozygosity, thereby splitting or creating additional ROH100 segments. Comparatively, the number of ROH100 in the BirThree dataset was even lower than in the pruned 3.5KJPNv2 dataset, suggesting that factors beyond SNP density, possibly sequencing errors, are still influencing shorter ROH detection in these datasets. However, the differences in SNP density depending on the minimal ROH lengths across the datasets did not have a notable effect on the results in our additional analysis, that included overlapped samples from both datasets, reaffirming that the additional application of pedigree information played a larger role.

Our functional analysis of runs of homozygosity (ROH) showed statistically significant pathway enrichments that may indicate areas experiencing selective pressure. Using BCFtools, we identified the ROH islands harboring the gene-families such as OR4, USP17 and TAS2R, which revealed statistically notable pathway enrichment processes related to protein deubiquitination and sensory perception, especially regarding taste and smell. These functions likely reflect adaptations to environmental conditions. For instance, the ability to find food sources and detect bitter compounds, which are often linked to toxins, can provide critical survival advantages. Previous research indicates genetic variations in TAS2R genes across human populations [27, 28]. Considerable variation in the perception of odorants has also long been established among populations, with some of this variation being attributed to genetic changes in olfactory receptor (OR) genes [29, 30]. This diversity is likely the result of natural selection favoring alleles that enhance the detection of bitter tastes or specific odors, which may vary based on different ancestries. In contrast, the ROH islands detected by PLINK were weakly associated with different processes, such as fatty acid metabolism and inositol phosphate signaling pathways. It is plausible that certain homozygous stretches of the genome may have been favored and conserved as ROH islands in the Japanese population through natural selection in response to certain environmental stimuli. Moreover, further exploration of shorter ROH holds promise for better understanding the roles of ROH islands in investigating the relationships with several diseases/traits phenotypes while also providing insights into population-specific genetic architectures and evolutionary events.

In the past, SNP array technology was the standard method for ROH detection, but the constraints in handling dense SNP regions led to failures in detecting shorter ROH segments. In recent years, the advent of NGS platforms allowed us to access large proportions of the genome in detail, but WGS technologies can generate more significant error rates than array-based ones. Notably, the error rates vary in different populations, and the Japanese population in particular exhibits a high rate of regarding mistakenly called heterozygotes, which is about 13,000 per genome, or approximately 4.5 SNPs per 1 Mb [3]. To reduce these error rates, a fruitful way to explore potential functionality of ROH segments would be to use WGS data together with leveraging pedigree information in bio-bank level cohorts. There are, however, several restrictions in using such sequencing technologies in most research, and these data are yet to be utilized to their full potential in ROH studies. In the TMM Project, high coverage WGS (>30x) data is consistently collected for expanded cohorts of both related and unrelated Japanese individuals on a large scale, including tens of thousands of participants, which can allow for higher-resolution detection of ROHs. Henceforth, our future works will center on ROH homozygosity mapping in collaboration with global ROH research communities by taking advantage of such high-coverage WGS data. We concur with previous research discussion [19] in that understanding where ROH affects diseases and traits in the genome and connecting these insights with previously identified significant loci from genome-wide association studies is a promising direction for further research.

To facilitate comparison with array data in our study, we trimmed the WGS dataset by defining OmniExpressExome array-specific regions, which may underlie potential precision variations. Despite adopting the alternative trimming method in our WGS data, we observed no significant differences in FROH between our results and previous study in Japanese population [13], thereby supporting the robustness of our findings. As a limitation, our study did not investigate the possibility of uniparental isodisomy and hemizygous deletion, which could lead to extended homozygosity, and such types of potential cytogenetic abnormalities should be taken into consideration in further studies.

In conclusion, our study demonstrates that by including unrelated individuals and family pedigree information, high-coverage genome sequencing enables detection of shorter ROHs, which are undetectable by genotyping arrays. Furthermore, while longer ROHs may be prone to sequencing errors, the integration of pedigree information can mitigate these inaccuracies. We also conducted a comparative analysis of the fine-scale detection of ROH segments, particularly in hotspot regions like ROH islands, in which we compared the two most representative tools in ROH research (BCFtools and PLINK). Our findings indicated that WGS dataset that incorporated pedigree data can exhibit significantly greater functional pathway enrichment. Additionally, we suggest that future improvements to BCFtools should focus on integrating sequencing errors into the HMM training model process to further enhance the accuracy of the ROH analyses.

Materials and methods

Whole-genome sequence data

We used two high-coverage (~30x) WGS datasets in the study: 3.5KJPNv2, which consists of data from a total of 3552 individuals participating in multiple Japanese cohorts, including the Tohoku Medical Megabank (TMM) Project cohorts [20], and BirThree, which was generated from 192 families (62 trios/106 septets/24 octets) consisting of a total of 1120 participants in the TMM Birth and Three-Generation (BirThree) Cohort [21]. Both datasets included 208 overlapping individuals. The TMM BirThree Cohort has the advantage of providing precise information on allele inheritance based on pedigree information available in the multigenerational study design, which made it possible for initial detection and removal of potential genotype error sites in the dataset.

To investigate the effect of marker density on ROH detection, we also generated trimmed datasets, in which genotypes are available only for SNP-array sites, of the two WGS datasets. We compared the numbers and the lengths of detected ROHs in the trimmed datasets with those in the initial datasets including all variant sites. We defined SNP sites on the Infinium OmniExpressExome-8 BeadChip (Illumina, San Diego, CA, USA) as “SNP array-based sites”.

RoH detection

Previous studies have predicted that long ROH segments are commonly found in the centromere regions of different species, which could possibly be related to selective sweeps and meiotic drive in these regions [5, 31, 32]. These studies indicated the absence of SNPs in centromeric regions, and so we excluded these regions from our analyses to avoid overestimating ROH. To achieve this, we divided each chromosome into short and long arms.

Initially, variants were filtered only to biallelic SNPs (-m2 -M2 -v snps), with allele frequencies between 0.05 and 0.95 (-q 0.05 -Q 0.95), and without missing alleles (-g ^miss) by using BCFtools. We primarily used Bcftools/RoH [11] to search for ROHs. RoH regions were detected for each chromosome of each sample, and per-region data (-O r option) was collected. Memory usage was set to 20GB (-b 20480 option). We used a fine-scale genetic map which was constructed from WGS data on 150 unrelated participants in the TMM BirThree Cohort (https://jmorp.megabank.tohoku.ac.jp/downloads/tommo-genetic_map-20210907) [33] and compared the results by switching it to the 1KGP genetic map [34]. We then set the parameter value of the -G option to 30 to account for GT errors.

Next, we applied PLINK 1.90 for comparison [7]. We then followed the command line parameters suggested in the previous study with slight modifications for the purpose of our study [12]. In part, for considering sequencing errors that may break a long ROH, 1 heterozygous call per detection window is generally allowed, but it was suggested to allow 3 heterozygous calls per window for low-coverage WGS data in order to provide equivalent results to SNP-array data [3]. Using this suggestion, we evaluated PLINK parameter by adjusting the allowance of heterozygous genotypes in the homozygous window. The detailed options and parameters used in the analyses are shown in the additional Information (S1 Text).

For each sample, the number and summed lengths of detected ROHs (NROH and SROH, respectively) were collected according to their minimal length class (100 Kb, 300 Kb, and 1.5 Mb). Detailed descriptive statistics for detected ROH segments, with different minimal lengths were shown in Supplementary Tables S1A–F.

To distinguish the effects of sequencing errors from those of SNP marker density, we added the additional methods by creating a modified version of the 3.5KJPNv2 dataset (referred to as the pruned 3.5KJPNv2 dataset), which involved 10 iterations where variant sites in 3.5KJPNv2 were randomly selected to match the total number of variant sites in BirThree that comprised of fewer variants, and ROH calling were done again by using BCFtools. For each iteration, we calculated the number of ROHs and then averaged these values to define the mean NROH of 3.5KJPNv2 SNP-pruning dataset. Subsequently, the results from pruned dataset were compared with those from both unfiltered 3.5KJPNv2 and BirThree datasets. Additionally, to assess the effects of SNP density using an alternative approach, we performed the analyses exclusively on the 208 individuals shared between both datasets.

We also identified ROH islands by focusing on the regions that contained on top 0.1% and 0.5% of the most frequently occurring loci with the ROH segments overlapped across the individuals by using BEDtools [35]. Subsequently, we annotated the genes that overlapped with ROH islands, based on the GENCODE v46lift37 version [36]. We then used g:Profiler [37], that integrates various databases, including Gene Ontology (GO), Kyoto Encyclopedia of Genes and Genomes (KEGG), and Reactome (REAC), enabling us to identify whether molecular functions and biological pathways are over-represented in our targeted gene lists.

Genomic based inbreeding coefficient (FROH) was estimated by dividing the cumulative length of all ROH segments in an individual's genome by the total length of the autosomal genome [13]. We also conducted the stratification analysis, grouping ROH segments into the different bins of size (S4A–H).

Additionally, we identified common sites with Mendelian-inconsistent calls that had frequently occurred in the three-generational BirThree dataset by PLINK’s mendel option (--mendel) and removed those sites from the 3.5KJPN dataset. Then, we compared the results in the 3.5KJPNv2 dataset, both with and without the inclusion of Mendelian-inconsistent error sites.

The study was approved by the Institutional Review Board of the Tohoku Medical Megabank Organization (initially approved with the approval number 2013-4-103, and last updated with the approval number 2023-4-075). The study was conducted in accordance with the Declaration of Helsinki.