Introduction

Despite the development of human reference genomes1,2, population-specific reference genomes are crucial for accurately capturing the genetic diversity and unique variations within distinct human populations. Several population-specific reference genomes were constructed and demonstrated novel genetic diversity in a population-specific manner3,4,5,6,7,8. To address this need, we previously developed JG1, a population-specific reference genome for the Japanese population8. JG1 incorporates the major alleles of Japanese individuals as reference alleles for variants, including single-nucleotide variants (SNVs), short insertion and deletions (indels) and structural variants. This was achieved by integrating de novo assemblies from three Japanese individuals, using meta-assembly strategies and majority decision among multiple genome assemblies to reduce the impact of rare or private variants. In addition, by anchoring scaffolds with marker information from genetic and radiation hybrid maps, JG1 was constructed independently of other human reference genomes, marking a substantial milestone in creating a de novo Japanese reference genome. JG1 is also used for next-generation sequencing (NGS) analyses to identify the causal variants of rare diseases in several studies9,10,11,12.

Despite its successes, JG1 has limitations, such as incomplete sequences, gaps, unlocalized fragments, limited original annotations and an incomplete representation of major variations within the Japanese population. Addressing these limitations could substantialy enhance the quality of the human reference genome, especially benefiting genome research involving Japanese or Asian populations. Thus, we developed JG2, an updated version of JG1, to overcome these challenges.

The construction of JG2, like JG1, was independent of the other human reference genomes. We utilized phased assembly, using Falcon, Falcon-unzip13 and Falcon-Phase algorithms with Hi-C reads14 on Pacific Biosciences (PacBio) RS II continuous long reads (CLR) from the three individuals. Hi-C reads enabled the integration of long-range chromatin interaction data, essential for resolving complex genomic regions and producing a more precise and continuous genome assembly15. This method is particularly beneficial for phased assembly, as it aids in differentiating between maternal and paternal haplotypes. Therefore, this approach resulted in two sets of haploid assembly per individual. The scaffolding process of fully phased contigs was also conducted by incorporating information from Hi-C reads, yielding six PacBio/Hi-C phased scaffolds. These six haploid scaffolds were then subjected to a meta-assembly process, leveraging the concept that two haploid genomes represent a random sample from a population, enhancing the representativeness of genetic variations. Building upon the techniques used in the construction of JG1, a meta-assembly was utilized to reconcile multiple assemblies. Scaffolds were anchored using marker information from genetic and radiation hybrid maps, ensuring that JG2’s reconstruction remained independent of other reference genomes. A majority decision of alleles was also conducted based on six haploid scaffolds. This process reduced the rare variants in JG2, making it more representative of the general Japanese population. The genome sequences and annotations of JG2 are available on the jMorp website (https://jmorp.megabank.tohoku.ac.jp/).

Materials and methods

Ethics declaration

This study received approval from the Research Ethical Committee of the Tohoku Medical Megabank Organization, Tohoku University.

Selection and analysis of donor individuals

The details of participant selection were described in our previous study8. In brief, three adult male Japanese volunteers were recruited. Japanese ancestry confirmation and individuals self-reported being healthy without any genetic diseases were obtained. Principal component analysis, which verifies their similarity with the Japanese population, and G-band analysis, which validates normal karyotypes, were shown in the previous study8.

PacBio CLR

The details of PacBio CLR sequencing were described previously8.

Bionano optical genome mapping

The details of Bionano optical genome mapping were also described in the previous study8.

Mate-pair dataset

Genomic DNA extracted from nucleated blood cells was utilized for library construction using a Nextera Mate Pair Library Preparation kit from Illumina, following the gel-free protocol provided by the manufacturer. This protocol yields a broader range of fragment sizes from 2 to 15 kb. Subsequently, the obtained libraries underwent size selection to achieve a range of 300–800 bp (with a peak at 500 bp) using AMPure XP beads from Beckman Coulter. The libraries were then sequenced on a HiSeq 2500 system from Illumina, using a TruSeq Rapid PE Cluster kit and TruSeq Rapid SBS kit to obtain 201-bp paired-end reads. Mate-pair dataset 1 and dataset 2 were sequenced under a sequencing depth of 12–13 and 34–38, respectively.

Short-read paired end

Short-read paired-end sequencing methods are consistent with those described in previous research8. Specifically, genomic DNA extracted from buffy coat samples was fragmented to an average target size of 550 bp. Library construction was then performed using the TruSeq DNA PCR-Free HT sample prep kit (Illumina), followed by sequencing on a HiSeq 2500 system. This utilized the TruSeq Rapid PE Cluster kit and TruSeq Rapid SBS kit to generate 162- or 259-bp paired-end reads.

ONT long reads

Details of the sample preparation and sequencing method of Oxford Nanopore Technologies (ONT) were described previously8.

Hi-C

Hi-C experiments were essentially performed according to a previously published protocol15. In brief, five million cells were cross-linked with 1% formaldehyde and quenched with 0.2 M glycine. Cells were lysed using Hi-C lysis buffer (10 mM Tris–HCl pH 8.0, 10 mM NaCl and 0.2% Igepal CA-630), and the chromatin was digested by either MboI (NEB, R0147) or HindIII-HF (NEB). Both ends of the digested chromatin were filled in and labeled with biotin-14-dATP (Life Technologies) for MboI or biotin-14-dCTP (Life Technologies) for HindIII-HF using Klenow Fragment (NEB) and ligated with T4 DNA Ligase (NEB). The biotin-labeled DNA was treated with Proteinase K, reverse cross-linked and sheared to 300–500 bp using a Covaris S220 Focused ultrasonicator. After the size selection using AMPure XP beads (Beckman Coulter), biotin-labeled sheared DNA fragments were enriched using Dynabeads MyOne Streptavidin T1 beads (Life Technologies). The recovered DNA was end-repaired and ligated to Illumina indexed adapters using the NEBNext Ultra DNA Library Prep Kit for Illumina NEB) and NEBNext Multiplex Oligos for Illumina (Index Primers Set 1: NEB). The adapter-ligated DNA underwent six or eight cycles of PCR amplification, followed by AMPure XP bead purification, and then used for sequencing.

De novo assembly of PacBio CLR reads

For each individual, we performed a phased assembly using PacBio CLR reads by FALCON, FALCON-unzip (ver. 1.1.2), and Quiver software of the pb-assembly software suite14. Using FALCON for initial assembly, primary contigs and associated contigs were generated. The results underwent full diploid assembly by FALCON-unzip. The outputs represented the updated primary contigs and haplotype-specific contigs as haplotigs. Subsequently, the results underwent genomic consensus calling by Quiver, yielding the polished version of primary contigs and haplotigs.

Phased assembly

To address the problem of haplotype switching, we used phased assembly by FALCON-Phase (ver. 1.1.0)16. Partially phased long-read assemblies, composed of primary contigs and haplotigs obtained from FALCON-unzip, along with genome-wide chromatin interaction datasets from Hi-C data, were used as the inputs for the analysis. In brief, the area where a haplotig intersects a primary contig constitutes a phase block, while sections of the primary contig devoid of associated haplotigs are denoted as collapsed regions. Primary contigs undergo segmentation at the alignment start and end positions of haplotigs. Hi-C read pairs are aligned to these segmented contigs, with only haplotype-specific alignments retained. A phasing algorithm assigns phase blocks to either state 0 or state 1. FALCON-Phase generates two complete pseudohaplotypes representing phases 0 and 1. Because two sets of Hi-C data deriving from MboI and HindIII enzymes were used, four sets of fully PacBio/Hi-C phased contigs were obtained from this step for each individual.

Scaffolding by SALSA2

SALSA2 (ref. 15) (version 2.2), a scaffolding tool utilizing genomic proximity information from Hi-C datasets, was used for the scaffolding process of fully phased contigs. At this step, the Hi-C dataset that was used for each fully phased contigs is the Hi-C dataset derived from different restriction enzymes from the previous phasing step. In other words, the MboI Hi-C dataset was used for scaffolding the HindIII-based phased assemblies, and conversely, the HindIII Hi-C dataset was used for scaffolding the MboI-based phased assemblies. Finally, four sets of PacBio/Hi-C phased scaffolds were obtained for each individual.

De novo assembly of Bionano optical genome maps

We obtained two sets of Bionano optical genome maps using two enzymes, Nt.BspQI and Nb.BssSI, for subject jg1a, and one set of Bionano optical genome maps was obtained with DLE-1 for jg1b and jg1c. In both cases, the Bionano optical genome maps were assembled in two steps—a rough assembly step and a full assembly step—to perform de novo assembly as independently as possible from the reference. The BionanoSolve software suite (ver. 3.2.1, ver. 3.5) was used for computation.

De novo assembly of nanopore reads

De novo assembly of ONT nanopore reads was conducted using Shasta (0.3.0), Racon (GitHub commit tag 6ca733a) and Medaka (ver. 0.11.1) software.

Polishing with Pilon

Two sets of Illumina paired-end short reads, 162 bp and 259 bp, were aligned to the fully phased contigs and hybrid scaffolds utilizing BWA MEM software17 (version 0.7.17). The resulting alignment files were sorted by coordinates and compressed using the Picard tools – SortSam command (version 2.20.5). Subsequently, the BAM files for the 162- and 259-bp paired-end reads were merged using the Picard tools MergeSamFiles command. These merged BAM files were then split into individual scaffolds using the SAMtools18 (version 1.9) view command, after which each contig/scaffold underwent polishing using Pilon software19 (version 1.23). Finally, the polished contig/scaffolds were merged into a single multi-FASTA format file.

Meta-assembly

The Metassembler algorithm20 performs pairwise, progressive alignments to merge multiple assemblies in the order specified by the user. One of the input assemblies will be used as the primary assembly and another as the secondary assembly. Mate-pair sequences dataset 1 and dataset 2 were used for meta-assembly within each individual and between individuals, respectively. The compression–expansion statistic (CE statistic) is calculated in both primary and secondary assemblies based on the mapping of mate-pair sequences in each assembly. The data from the secondary assembly are used to improve the primary assembly, such as correction of insertion/deletion errors, closing gaps and scaffolding sequences, which are based on comparing CE statistics between two assemblies.

For meta-assembly within an individual, four sets of PacBio/Hi-C phased scaffolds from each individual underwent a meta-assembly process by using the Metassembler software (ver. 1.5 with the modification described in ref. 8)20. There were 24 possible combinations of meta-assembly of four sets of phased scaffolds (Supplementary Table 5). Among 24 meta-assemblies, one meta-assembly with the longest scaffold length and the least number of scaffolds was selected for further hybrid scaffolding with Bionano-assembled genome maps and ONT assembly.

For meta-assembly among the three individuals, the three sets of polished scaffolds were then meta-assembled using Metassembler software20. There were 12 possible combinations to meta-assemble the three sets: (jg1a + (jg1b + jg1c)), (jg1a + (jg1c + jg1b)), ((jg1a + jg1b) + jg1c), ((jg1a + jg1c) + jg1b), (jg1b + (jg1a + jg1c)), (jg1b + (jg1c + jg1a)), ((jg1b + jg1a) + jg1c), ((jg1b + jg1c) + jg1a), (jg1c + (jg1a + jg1b)), (jg1c + (jg1b + jg1a)), ((jg1c + jg1a) + jg1b) and ((jg1c + jg1b) + jg1a), where x + y indicates meta-assembling x and y in this order. For each round of meta-assembly, assemblies were aligned using NUCmer21, filtered with delta-filter, and converted to COORDS format using show-coords. Mate-pair reads were classified with NxTrim22 and mapped using Bowtie2 (ref. 23), followed by processing with mateAn. The alignment and mapping information were integrated using asseMerge, and the final output was converted to FASTA format with meta2fasta. All of the meta-assemblies in this step were used for anchoring to generate pseudomolecules.

Detection of in silico STS marker amplification and anchoring scaffolds to chromosomes

We performed in silico amplification of the sequence-tagged site (STS) markers of three genetic and six radiation hybrid (RH) maps (Genethon24, Marshfield25 and deCODE26 genetic maps; GeneMap-G3 (ref. 27), GeneMap99-GB4 (ref. 27), TNG28, NCBI_RH29, Stanford-G3 (ref. 30) and Whitehead-RH maps31) on the meta-scaffolds by using in-house electronic PCR software, gPCR (version 2.6a). The STS markers were sourced from the UniSTS database (ftp://ftp.ncbi.nih.gov/pub/ProbeDB/legacy_unists/). One combination of meta-scaffolds from jg1a, jg1b and jg1c per each chromosome was selected for downstream analysis. The meta-scaffolds were anchored to the chromosomes by the path command of ALLMAPS (ver. 0.8.12)32.

Mitochondrial genome and unplaced scaffolds

For the mitochondrial genome, we utilized the mitochondrial genome from JG1. Unanchored contigs/scaffolds generated from anchoring processes against the genetic and radiation hybrid maps using ALLMAPS were also collected. Subsequently, we removed contigs/scaffolds that mapped to the previously described set of chromosomes. In addition, contigs/scaffolds shorter than 1 kb were excluded. Next, we conducted an all-by-all alignment of the remaining contigs/scaffolds to obtain a set of unique sequences. Finally, scaffolds with N-gaps exceeding 80% of the sequence were excluded.

Major allele substitution and manual modification

The selected set of meta-assemblies was aligned with two sets of six meta-scaffolds (HindIII and MBoI) using minimap2 (version 2.17)33. Variants in each set were identified using the paftools call command. To standardize variant representation, we used the BCFtools norm command (version 1.9). Major allele substitutions were conducted by selecting the allele shared by more than three of six meta-scaffolds in each set. For multiallelic sites with equal allele frequencies, the selection was made randomly. Consecutive N-gap length for heterochromatic regions was manually modified.

JG2 assembly assessment

Consensus quality

The JG2 was aligned with the GRCh38 reference genome using the NUCmer command from the MUMmer software suite21. The proportion of covered regions and the average identity between the genomes were calculated using the dnadiff tool, also part of the MUMmer suite34. In addition, assemblies were aligned to the GRCh38 reference genome using minimap2 software, and variants were identified with the paftools call command. The normalized variants were then annotated using the SnpEff software35, referencing the GRCh38.86 database.

Representativeness of JG2 for Japanese variants

To determine whether JG2 harbors the major allele among the Japanese, JG2 was aligned against the reference genome hs37d5 to detect SNVs. Genome-by-genome alignment and comparison were performed using minimap2 and paftools software33. Allele frequency (AF) was investigated on the 3.5KJPNv2 AF panel36,37. AF spectra of JG2 and JG1 were created with the horizontal axis representing the non-hs37d5-type allele and the vertical axis showing the number of such variant sites.

Variant calling in 104 JPT samples from the 1000 Genomes Project

High-coverage (30×) CRAM files for 104 JPT (Japanese in Tokyo, Japan) individuals were downloaded from the International Genome Sample Resource. These CRAM files were subsequently converted to paired-end FASTQ format. The resulting FASTQ files were aligned to the respective reference genomes (JG2 and GRCh38) using BWA-MEM17. Following alignment, duplicate reads were marked, and subsequently, SNVs and indels were called using Strelka238. Callable regions were defined as 100-bp windows where sequencing depth (DP) was 10 ≤ DP < 100 in ≥90% of the 104 JPT individuals.

Results

JG2 construction

The individuals selected for constructing JG2 were three Japanese male volunteers named jg1a, jg1b and jg1c, the same individuals used to create the previous Japanese reference genome, JG1 (ref. 8). For each individual, we obtained over 120× PacBio CLR, two sets of over 49× Hi-C reads (using MboI and HindIII enzymes), one or two sets of over 120× Bionano optical genome maps (using DSL-1 or BspQI and BssSI enzymes), one set of ONT long reads, two sets of mate-pair reads and two sets of paired-end reads (Illumina HiSeq 162-bp and 259-bp reads). The depth of reads or optical genome maps is summarized in Supplementary Table 1.

Genome assembly for each individual

JG2 was created by first performing a phased assembly for each individual and then integrating genomes among the three Japanese individuals (Fig. 1). We performed a phased assembly for each individual using PacBio CLR reads by Falcon, Falcon-unzip and Quiver software of the pb-assembly software suite (Fig. 1A). The assembly statistics, including total length, N50 and the number of contigs, are presented in Supplementary Table 2. After the polishing, the number of primary contigs of jg1a, jg1b and jg1c were 1,439, 1,386 and 1271, respectively. The N50 of all participants is around 20 Mb.

Fig. 1: Construction strategy of the JG2 reference genome.
figure 1

Three Japanese male volunteers (jg1a, jg1b and jg1c) provided the data for JG2. High-coverage sequencing and optical genome mapping data were generated: PacBio CLRs, Hi-C reads, Bionano optical genome maps, ONT nanopore long reads, Illumina mate-pair reads and Illumina paired-end reads. A Phased genome assemblies and scaffolding were performed for each individual using PacBio CLRs and Hi-C reads. Meta-assemblies were created within each individual. B Bionano optical genome maps were assembled, and PacBio/Hi-C-based assemblies were scaffolded with the optical genome maps in each individual. In addition, ONT long reads were assembled within each individual, then meta-assembled with the PacBio/Hi-C/Bionano-based assemblies, resulting in a draft assembly for each individual. C Draft assemblies for the three individuals were meta-assembled. White boxes indicate raw data, gray boxes indicate intermediate assemblies and rounded black boxes indicate software or operations.

Subsequently, by using two sets of Hi-C data separately, we conducted phased assembly, which generated two sets of haploid genome assemblies per individual using Falcon-Phase software: MboI-based phased assemblies and HindIII-based phased assemblies (Fig. 1A). The assembly statistics are presented in Supplementary Table 3. These assemblies were then polished using 162-bp and 259-bp HiSeq short reads by Pilon software. Scaffolding was conducted on each set of polished haploid genome assemblies of the PacBio/Hi-C phased contigs utilizing SALSA2 software. Specifically, we utilized the Hi-C dataset that was not employed for phasing. For example, we used the MboI Hi-C dataset for scaffolding the HindIII-based phased assemblies. Conversely, we used the HindIII Hi-C dataset for scaffolding the MboI-based phased assemblies (Fig. 1A). Consequently, we obtained four sets of PacBio/Hi-C phased scaffolds for each individual. The assembly statistics are presented in Supplementary Table 3. The number of scaffolds of four sets of phased scaffolds in jg1a, jg1b and jg1c was 997 ± 55 (mean ± s.d.).

In addition to PacBio/Hi-C phased scaffolds, de novo assembly of Bionano optical genome maps was performed by BionanoSolve software, and de novo assembly of ONT nanopore reads was conducted using Shasta, Racon and Medaka software (Fig. 1B). Subsequently, the ONT assemblies were refined by polishing with 162-bp and 259-bp HiSeq short reads using Pilon software. The summary of assembly statistics is presented in Supplementary Table 4.

In summary we acquired four sets of phased PacBio/Hi-C assemblies from each participant, along with one or two sets of Bionano-assembled genome maps and one set of ONT assemblies. These datasets were subsequently used in further meta-assembly.

Meta-assembly within each individual

The meta-assembly process involves integrating the locally best sequences from all input assemblies across the genome. These sequences are merged to create a final sequence that is either as good as or superior to the individual constituent assemblies. For each individual, four sets of phased PacBio/Hi-C assemblies underwent two-step meta-assemblies, resulting in 24 different meta-assemblies (Fig. 1A). Then, the meta-assembly with the fewest mis-assemblies—determined by comparison with genetic/RH maps and JG1—was selected for further hybrid scaffolding using Bionano-assembled genome maps and ONT assembly. Mate-pair short reads were used for meta-assembly by Metassembler software. The meta-assembly statistics for two-step meta-assemblies of phased PacBio/Hi-C assemblies are presented in Supplementary Table 5.

Following that, we conducted hybrid scaffolding of the meta-assembly using Bionano genome maps through BionanoSolve software. Subsequently, the resulting hybrid scaffolds were merged with the ONT assembly utilizing merged dataset 1 mate-pair short reads via Metassembler software. In this step of meta-assemblies within each individual, one meta-assembly per individual was generated. Consequently, we obtained three sets of meta-assemblies for the three individuals.

Meta-assemblies among individuals

Meta-assembly was conducted among the three individuals using mate-pair short reads from dataset 2 through Metassembler software, resulting in 12 sets of meta-assemblies (Fig. 1C). Subsequently, 12 sets of meta-assemblies were anchored to 3 genetic and 6 radiation hybrid maps using ALLMAPS software, resulting in 12 sets of pseudomolecules.

Selection of JG2

By comparing the pseudomolecules anchored to each linkage group with the corresponding chromosome sequence of JG2, one pseudomolecule was chosen for each chromosome sequence for JG2 (Supplementary Table 6). The selected pseudomolecules, mitochondrial reference genome and unplaced scaffolds comprise a full set of the reference genome. Major allele substitution was performed using variants from two sets of six meta-scaffolds (HindIII and MBoI) to remove rare individual variants. Manual modifications were conducted to adjust the N-gap length for the acrocentric, centromeric, heterochromatic and telomeric regions, yielding the final set of genome sequences of JG2.

As all donors were male, the Y chromosome was also assembled. The final length of the Y chromosome was 50,152,051 bp, including 36,355,468 bp of N-gaps. Most of these gaps (35,226,671 bp) were intentionally inserted to represent centromeric and telomeric regions and to mask the pseudoautosomal regions for improved utility in NGS analysis. Excluding these intentional gaps, the resolved sequence was approximately 14 Mb, with remaining undetermined regions constituting less than 8% of this length, indicating a high-quality assembly.

Evaluation of JG2

The procedure described above resulted in a set of chromosome-level sequences for 22 autosomes, 2 sex chromosomes, 1 mitochondrial chromosome and 1148 unplaced scaffolds collectively designated as JG2. The total length of JG2 was approximately 3.1 Gb, which included 609 gap regions totaling 251 Mb. Of these gaps, 233.63 Mb were intentionally inserted to represent telomeric, centromeric and heterochromatic regions. The scaffold N50 was 152,668,378 bp. The number of misassemblies identified in JG2 was 581 when compared against GRCh38, and 503 against T2T-CHM13v2.039. Dot plot alignments of JG2 against these reference genomes are shown in Fig. 2. JG2 covered 91.11% of the GRCh38 with an average identity of 99.80% and covered 90.82% of the T2T-CHM13v2.0 with an average identity of 99.81%. Comparative analyses of JG2 against hs37d5, AK1, KOREF1, HX1 and the previously constructed JG1 are shown in Supplementary Fig. 1 and Supplementary Table 7.

Fig. 2: Dot plots show co-linearity between JG2 and other reference genomes.
figure 2

A JG2 versus GRCh38. B JG2 versus T2T-CHM13v2.0.

Compared with the GRCh38 reference genome, JG2 contained 3,115,695 variants, of which 2130 (0.054%) were classified as high-impact variants. The number of protein-truncating variants, which include stop-gained, frameshift, splice-acceptor and splice-donor variants, was 55, 280, 141 and 118, respectively.

Representativeness of JG2 for SNVs composition of the Japanese population

The genome-by-genome alignment and comparison revealed 2,321,710 SNVs between hs37d5 and JG2 in the autosomes and X chromosome. Of these SNVs, 298,644 had an AF of 1.0 in the 3.5KJPNv2 AF panel, indicating that all Japanese have JG2-type alleles at these 298,644 sites. This number is larger than that of JG1 (246 464), demonstrating that JG2 better represents major alleles in the Japanese population. In addition, we identified 366,364 SNV sites with an AF ≥0.99 and 624,847 SNV sites with an AF ≥0.90 in the 3.5KJPNv2 AF panel, respectively. AF spectra for SNVs comparing JG2 and JG1 are demonstrated in Fig. 3.

Fig. 3: The AF spectrum of JG2 and JG1.
figure 3

JG2 was aligned to the reference genome hs37d5 to detect SNVs. Allele frequencies in the Japanese population were analyzed using the 3.5KJPNv2 AF panel. The spectrum displays the alternative AF for SNVs on the horizontal axis and the number of variant sites on the vertical axis.

Variant calling with JG2 for 104 JPT individuals from 1KGP

To evaluate the utility of JG2 as a population-specific reference genome for the Japanese population, variant calling was performed using high-coverage WGS data from 104 JPT (Japanese in Tokyo) individuals from the 1000 Genomes Project aligned to both JG2 and GRCh38. The number of SNVs and indels called was consistently lower when using JG2. Specifically, the number of SNVs was 2,920,870 ± 17,174 for JG2, compared with 3,772,565 ± 22,010 for GRCh38 (Fig. 4A). Similarly, the number of indels was 771,343 ± 6058 for JG2, compared with 862,144 ± 6299 for GRCh38 (Fig. 4B). While the callable regions were slightly lower for some chromosomes in JG2 (Supplementary Table 8), this minor difference is substantially offset by a disproportionately larger reduction in detected variants (approximately 851,700 fewer SNVs and 90,800 fewer indels on average per individual). This highlights JG2’s improved accuracy for variant analysis in Japanese individuals by better reflecting population-specific sequences and thus reducing false-positive calls attributable to divergence from the reference.

Fig. 4: Comparison of variant calls and callable regions between JG2 and GRCh38 reference genomes.
figure 4

A The number of SNVs per individual called from 104 JPT (Japanese in Tokyo) samples aligned to either JG2 or GRCh38 reference genome. B The number of indels per individual from the same dataset.

Discussion

The current study demonstrated the strategy to assemble the population-specific reference genome, JG2, from three individuals whose genome data were used to construct the previous version of the Japanese specific reference genome, JG1. By applying advanced strategies, JG2 better covers Japanese-specific variants and enhances assembly statistics. Using Hi-C data for phase-aware assembly, the process generated two haploid assemblies per individual. Unlike JG1, this method treats the haploid genomes as random samples from a population, increasing genetic variation representation. The increase in haplotypes enhanced population representativeness and, through majority decision, reduced the presence of rare personal variants, making JG2 more generalizable to the Japanese population.

Hi-C sequencing technology was instrumental in our phased assembly process. Hi-C data provide long-range contact information that helps to accurately place contigs and scaffolds into their correct positions along the chromosomes, enabling the differentiation of maternal and paternal haplotypes40,41. This phasing is crucial for understanding the allelic variations and structural differences between the two sets of chromosomes, providing a more complete and accurate representation of the genome.

The meta-assembly strategy, which involved combining the best sequences from multiple assemblies, enhanced assembly quality for the Japanese genome8. For each individual, the meta-assembly process led to enhanced scaffold lengths, indicated by the increased N50 in JG2 compared with JG1 (152,668,378 versus 141,953,703). In addition, a majority decision strategy was also applied to remove rare variants. By comparing multiple assemblies and subsequent selection of the consensus sequence, we effectively mitigated the inclusion of sequencing errors and rare personal variants that might not represent the population. This approach reduced the number of misassemblies in JG2 compared with JG1 (1055 versus 1654). This approach ensured that the final assemblies were more accurate and reflective of the common genomic features within the Japanese population. The iterative meta-assembly among individuals further consolidated these improvements, creating comprehensive and representative pseudomolecules for each chromosome. Moreover, variant calling using 104 Japanese individuals with JG2 as the reference resulted in consistently fewer SNVs and indels compared to using GRCh38. This indicates that JG2 provides a better match to the Japanese allele spectrum and improves the accuracy of variant identification for this population.

Despite its advancements, JG2 is based on technology slightly behind the state of the art compared with the recent developments presented by T2T39 and pangenome projects42,43. T2T utilized high-accuracy long-read sequences such as HiFi reads to construct a high-resolution assembly string graph to address repetitive and complex regions such as centromeres, telomeres and other difficult-to-sequence areas39. In JG2, meanwhile, these regions are represented by N as uncharacterized bases. A pangenome encompasses the collective whole-genome sequences of multiple individuals to capture the genetic diversity of a species, utilizing high-quality, phased haplotypes and a graph-based data structure for improved reference and variant indexing, with compatibility to existing human reference genomes42,43. However, it is important to note that JG2 has been steadily updated compared with JG1. This continuous improvement ensures that JG2 remains a valuable fundamental information resource that can be used for various analyses, such as identifying population-specific genetic variations or conducting comparative studies within the Japanese population.

In conclusion, this study improved the Japanese reference genome, creating JG2 via phased assembly from three individuals. Hi-C data enabled phase-aware assembly, generating two haploid assemblies per individual and better representing genetic variation. A meta-assembly strategy combined the best sequences, improving scaffold lengths and accuracy, while a majority decision approach minimized rare variants. These methods produced high-quality, comprehensive pseudomolecules, making JG2 more representative of the Japanese population.