JG2: an updated version of the Japanese population-specific reference genome

Sriwichaiin, Sirawit; Makino, Satoshi; Funayama, Takamitsu; Otsuki, Akihito; Kawashima, Junko; Okamura, Yasunobu; Tadaka, Shu; Katsuoka, Fumiki; Kumada, Kazuki; Tsutsumi, Shuichi; Kinoshita, Kengo; Yamamoto, Masayuki; Tamiya, Gen; Takayama, Jun

doi:10.1038/s41439-025-00326-y

Download PDF

Article
Open access
Published: 01 October 2025

JG2: an updated version of the Japanese population-specific reference genome

Sirawit Sriwichaiin¹,
Satoshi Makino²,
Takamitsu Funayama^1,2,
Akihito Otsuki²,
Junko Kawashima²,
Yasunobu Okamura ORCID: orcid.org/0000-0001-9847-9037^2,3,
Shu Tadaka²,
Fumiki Katsuoka^2,3,
Kazuki Kumada²,
Shuichi Tsutsumi⁴,
Kengo Kinoshita^2,3,5,6,
Masayuki Yamamoto^2,7,
Gen Tamiya^1,2,8 &
…
Jun Takayama ORCID: orcid.org/0000-0003-2803-6526^1,2,8

Human Genome Variation volume 12, Article number: 21 (2025) Cite this article

2956 Accesses
3 Altmetric
Metrics details

Subjects

Abstract

Here we present the construction of JG2, an updated population-specific reference genome for the Japanese population. Utilizing data from three individuals previously used in the construction of JG1, several methodologies were employed to enhance genomic coverage and assembly quality. Hi-C sequencing technology facilitated phase-aware assembly, generating two haploid assemblies per individual and enabling improved representation of genetic variation. A meta-assembly strategy and a majority decision approach further refined assembly quality by combining the best sequences from multiple assemblies and minimizing the inclusion of rare variants. The resulting JG2 genome comprises chromosome-level sequences, mitochondrial chromosomes and unplaced scaffolds, offering more comprehensive coverage of the Japanese genome. Comparative analyses with other reference genomes demonstrated the accuracy and representativeness of JG2, highlighting its utility for genetic research involving the Japanese population. Overall, by adopting the phased assembly technique, JG2 represents a substantial advancement over the collapsed assembly-based JG1, with improvements including a greater number of identified variants (3,115,695 variants, of which 298,644 had an allele frequency (AF) of 1.0 in the 3.5KJPNv2 AF panel) and a higher N50 value (152,668,378 bp). These enhancements provide researchers with a more precise and comprehensive resource for understanding the genetic landscape of the Japanese population. The sequences and annotations are available on the jMorp website (https://jmorp.megabank.tohoku.ac.jp/).

Whole-genome sequencing of 3135 individuals representing the genetic diversity of the Japanese population

Article Open access 08 November 2025

Chromosomal-level genome assembly of Trichogramma japonicum (Hymenoptera: Trichogrammatidae)

Article Open access 02 September 2025

Chromosome-Level Genome Assembly of the Japanese Zacco platypus for Comparative Genomics

Article Open access 23 December 2025

Introduction

Despite the development of human reference genomes^1,2, population-specific reference genomes are crucial for accurately capturing the genetic diversity and unique variations within distinct human populations. Several population-specific reference genomes were constructed and demonstrated novel genetic diversity in a population-specific manner^3,4,5,6,7,8. To address this need, we previously developed JG1, a population-specific reference genome for the Japanese population⁸. JG1 incorporates the major alleles of Japanese individuals as reference alleles for variants, including single-nucleotide variants (SNVs), short insertion and deletions (indels) and structural variants. This was achieved by integrating de novo assemblies from three Japanese individuals, using meta-assembly strategies and majority decision among multiple genome assemblies to reduce the impact of rare or private variants. In addition, by anchoring scaffolds with marker information from genetic and radiation hybrid maps, JG1 was constructed independently of other human reference genomes, marking a substantial milestone in creating a de novo Japanese reference genome. JG1 is also used for next-generation sequencing (NGS) analyses to identify the causal variants of rare diseases in several studies^9,10,11,12.

Despite its successes, JG1 has limitations, such as incomplete sequences, gaps, unlocalized fragments, limited original annotations and an incomplete representation of major variations within the Japanese population. Addressing these limitations could substantialy enhance the quality of the human reference genome, especially benefiting genome research involving Japanese or Asian populations. Thus, we developed JG2, an updated version of JG1, to overcome these challenges.

The construction of JG2, like JG1, was independent of the other human reference genomes. We utilized phased assembly, using Falcon, Falcon-unzip¹³ and Falcon-Phase algorithms with Hi-C reads¹⁴ on Pacific Biosciences (PacBio) RS II continuous long reads (CLR) from the three individuals. Hi-C reads enabled the integration of long-range chromatin interaction data, essential for resolving complex genomic regions and producing a more precise and continuous genome assembly¹⁵. This method is particularly beneficial for phased assembly, as it aids in differentiating between maternal and paternal haplotypes. Therefore, this approach resulted in two sets of haploid assembly per individual. The scaffolding process of fully phased contigs was also conducted by incorporating information from Hi-C reads, yielding six PacBio/Hi-C phased scaffolds. These six haploid scaffolds were then subjected to a meta-assembly process, leveraging the concept that two haploid genomes represent a random sample from a population, enhancing the representativeness of genetic variations. Building upon the techniques used in the construction of JG1, a meta-assembly was utilized to reconcile multiple assemblies. Scaffolds were anchored using marker information from genetic and radiation hybrid maps, ensuring that JG2’s reconstruction remained independent of other reference genomes. A majority decision of alleles was also conducted based on six haploid scaffolds. This process reduced the rare variants in JG2, making it more representative of the general Japanese population. The genome sequences and annotations of JG2 are available on the jMorp website (https://jmorp.megabank.tohoku.ac.jp/).

Materials and methods

Ethics declaration

This study received approval from the Research Ethical Committee of the Tohoku Medical Megabank Organization, Tohoku University.

Selection and analysis of donor individuals

The details of participant selection were described in our previous study⁸. In brief, three adult male Japanese volunteers were recruited. Japanese ancestry confirmation and individuals self-reported being healthy without any genetic diseases were obtained. Principal component analysis, which verifies their similarity with the Japanese population, and G-band analysis, which validates normal karyotypes, were shown in the previous study⁸.

PacBio CLR

The details of PacBio CLR sequencing were described previously⁸.

Bionano optical genome mapping

The details of Bionano optical genome mapping were also described in the previous study⁸.

Mate-pair dataset

Genomic DNA extracted from nucleated blood cells was utilized for library construction using a Nextera Mate Pair Library Preparation kit from Illumina, following the gel-free protocol provided by the manufacturer. This protocol yields a broader range of fragment sizes from 2 to 15 kb. Subsequently, the obtained libraries underwent size selection to achieve a range of 300–800 bp (with a peak at 500 bp) using AMPure XP beads from Beckman Coulter. The libraries were then sequenced on a HiSeq 2500 system from Illumina, using a TruSeq Rapid PE Cluster kit and TruSeq Rapid SBS kit to obtain 201-bp paired-end reads. Mate-pair dataset 1 and dataset 2 were sequenced under a sequencing depth of 12–13 and 34–38, respectively.

Short-read paired end

Short-read paired-end sequencing methods are consistent with those described in previous research⁸. Specifically, genomic DNA extracted from buffy coat samples was fragmented to an average target size of 550 bp. Library construction was then performed using the TruSeq DNA PCR-Free HT sample prep kit (Illumina), followed by sequencing on a HiSeq 2500 system. This utilized the TruSeq Rapid PE Cluster kit and TruSeq Rapid SBS kit to generate 162- or 259-bp paired-end reads.

ONT long reads

Details of the sample preparation and sequencing method of Oxford Nanopore Technologies (ONT) were described previously⁸.

Hi-C

Hi-C experiments were essentially performed according to a previously published protocol¹⁵. In brief, five million cells were cross-linked with 1% formaldehyde and quenched with 0.2 M glycine. Cells were lysed using Hi-C lysis buffer (10 mM Tris–HCl pH 8.0, 10 mM NaCl and 0.2% Igepal CA-630), and the chromatin was digested by either MboI (NEB, R0147) or HindIII-HF (NEB). Both ends of the digested chromatin were filled in and labeled with biotin-14-dATP (Life Technologies) for MboI or biotin-14-dCTP (Life Technologies) for HindIII-HF using Klenow Fragment (NEB) and ligated with T4 DNA Ligase (NEB). The biotin-labeled DNA was treated with Proteinase K, reverse cross-linked and sheared to 300–500 bp using a Covaris S220 Focused ultrasonicator. After the size selection using AMPure XP beads (Beckman Coulter), biotin-labeled sheared DNA fragments were enriched using Dynabeads MyOne Streptavidin T1 beads (Life Technologies). The recovered DNA was end-repaired and ligated to Illumina indexed adapters using the NEBNext Ultra DNA Library Prep Kit for Illumina NEB) and NEBNext Multiplex Oligos for Illumina (Index Primers Set 1: NEB). The adapter-ligated DNA underwent six or eight cycles of PCR amplification, followed by AMPure XP bead purification, and then used for sequencing.

De novo assembly of PacBio CLR reads

For each individual, we performed a phased assembly using PacBio CLR reads by FALCON, FALCON-unzip (ver. 1.1.2), and Quiver software of the pb-assembly software suite¹⁴. Using FALCON for initial assembly, primary contigs and associated contigs were generated. The results underwent full diploid assembly by FALCON-unzip. The outputs represented the updated primary contigs and haplotype-specific contigs as haplotigs. Subsequently, the results underwent genomic consensus calling by Quiver, yielding the polished version of primary contigs and haplotigs.

Phased assembly

To address the problem of haplotype switching, we used phased assembly by FALCON-Phase (ver. 1.1.0)¹⁶. Partially phased long-read assemblies, composed of primary contigs and haplotigs obtained from FALCON-unzip, along with genome-wide chromatin interaction datasets from Hi-C data, were used as the inputs for the analysis. In brief, the area where a haplotig intersects a primary contig constitutes a phase block, while sections of the primary contig devoid of associated haplotigs are denoted as collapsed regions. Primary contigs undergo segmentation at the alignment start and end positions of haplotigs. Hi-C read pairs are aligned to these segmented contigs, with only haplotype-specific alignments retained. A phasing algorithm assigns phase blocks to either state 0 or state 1. FALCON-Phase generates two complete pseudohaplotypes representing phases 0 and 1. Because two sets of Hi-C data deriving from MboI and HindIII enzymes were used, four sets of fully PacBio/Hi-C phased contigs were obtained from this step for each individual.

Scaffolding by SALSA2

SALSA2 (ref. ¹⁵) (version 2.2), a scaffolding tool utilizing genomic proximity information from Hi-C datasets, was used for the scaffolding process of fully phased contigs. At this step, the Hi-C dataset that was used for each fully phased contigs is the Hi-C dataset derived from different restriction enzymes from the previous phasing step. In other words, the MboI Hi-C dataset was used for scaffolding the HindIII-based phased assemblies, and conversely, the HindIII Hi-C dataset was used for scaffolding the MboI-based phased assemblies. Finally, four sets of PacBio/Hi-C phased scaffolds were obtained for each individual.

De novo assembly of Bionano optical genome maps

We obtained two sets of Bionano optical genome maps using two enzymes, Nt.BspQI and Nb.BssSI, for subject jg1a, and one set of Bionano optical genome maps was obtained with DLE-1 for jg1b and jg1c. In both cases, the Bionano optical genome maps were assembled in two steps—a rough assembly step and a full assembly step—to perform de novo assembly as independently as possible from the reference. The BionanoSolve software suite (ver. 3.2.1, ver. 3.5) was used for computation.

De novo assembly of nanopore reads

De novo assembly of ONT nanopore reads was conducted using Shasta (0.3.0), Racon (GitHub commit tag 6ca733a) and Medaka (ver. 0.11.1) software.

Polishing with Pilon

Two sets of Illumina paired-end short reads, 162 bp and 259 bp, were aligned to the fully phased contigs and hybrid scaffolds utilizing BWA MEM software¹⁷ (version 0.7.17). The resulting alignment files were sorted by coordinates and compressed using the Picard tools – SortSam command (version 2.20.5). Subsequently, the BAM files for the 162- and 259-bp paired-end reads were merged using the Picard tools MergeSamFiles command. These merged BAM files were then split into individual scaffolds using the SAMtools¹⁸ (version 1.9) view command, after which each contig/scaffold underwent polishing using Pilon software¹⁹ (version 1.23). Finally, the polished contig/scaffolds were merged into a single multi-FASTA format file.

Meta-assembly

The Metassembler algorithm²⁰ performs pairwise, progressive alignments to merge multiple assemblies in the order specified by the user. One of the input assemblies will be used as the primary assembly and another as the secondary assembly. Mate-pair sequences dataset 1 and dataset 2 were used for meta-assembly within each individual and between individuals, respectively. The compression–expansion statistic (CE statistic) is calculated in both primary and secondary assemblies based on the mapping of mate-pair sequences in each assembly. The data from the secondary assembly are used to improve the primary assembly, such as correction of insertion/deletion errors, closing gaps and scaffolding sequences, which are based on comparing CE statistics between two assemblies.

For meta-assembly within an individual, four sets of PacBio/Hi-C phased scaffolds from each individual underwent a meta-assembly process by using the Metassembler software (ver. 1.5 with the modification described in ref. ⁸)²⁰. There were 24 possible combinations of meta-assembly of four sets of phased scaffolds (Supplementary Table 5). Among 24 meta-assemblies, one meta-assembly with the longest scaffold length and the least number of scaffolds was selected for further hybrid scaffolding with Bionano-assembled genome maps and ONT assembly.

For meta-assembly among the three individuals, the three sets of polished scaffolds were then meta-assembled using Metassembler software²⁰. There were 12 possible combinations to meta-assemble the three sets: (jg1a + (jg1b + jg1c)), (jg1a + (jg1c + jg1b)), ((jg1a + jg1b) + jg1c), ((jg1a + jg1c) + jg1b), (jg1b + (jg1a + jg1c)), (jg1b + (jg1c + jg1a)), ((jg1b + jg1a) + jg1c), ((jg1b + jg1c) + jg1a), (jg1c + (jg1a + jg1b)), (jg1c + (jg1b + jg1a)), ((jg1c + jg1a) + jg1b) and ((jg1c + jg1b) + jg1a), where x + y indicates meta-assembling x and y in this order. For each round of meta-assembly, assemblies were aligned using NUCmer²¹, filtered with delta-filter, and converted to COORDS format using show-coords. Mate-pair reads were classified with NxTrim²² and mapped using Bowtie2 (ref. ²³), followed by processing with mateAn. The alignment and mapping information were integrated using asseMerge, and the final output was converted to FASTA format with meta2fasta. All of the meta-assemblies in this step were used for anchoring to generate pseudomolecules.

Detection of in silico STS marker amplification and anchoring scaffolds to chromosomes

We performed in silico amplification of the sequence-tagged site (STS) markers of three genetic and six radiation hybrid (RH) maps (Genethon²⁴, Marshfield²⁵ and deCODE²⁶ genetic maps; GeneMap-G3 (ref. ²⁷), GeneMap99-GB4 (ref. ²⁷), TNG²⁸, NCBI_RH²⁹, Stanford-G3 (ref. ³⁰) and Whitehead-RH maps³¹) on the meta-scaffolds by using in-house electronic PCR software, gPCR (version 2.6a). The STS markers were sourced from the UniSTS database (ftp://ftp.ncbi.nih.gov/pub/ProbeDB/legacy_unists/). One combination of meta-scaffolds from jg1a, jg1b and jg1c per each chromosome was selected for downstream analysis. The meta-scaffolds were anchored to the chromosomes by the path command of ALLMAPS (ver. 0.8.12)³².

Mitochondrial genome and unplaced scaffolds

For the mitochondrial genome, we utilized the mitochondrial genome from JG1. Unanchored contigs/scaffolds generated from anchoring processes against the genetic and radiation hybrid maps using ALLMAPS were also collected. Subsequently, we removed contigs/scaffolds that mapped to the previously described set of chromosomes. In addition, contigs/scaffolds shorter than 1 kb were excluded. Next, we conducted an all-by-all alignment of the remaining contigs/scaffolds to obtain a set of unique sequences. Finally, scaffolds with N-gaps exceeding 80% of the sequence were excluded.

Major allele substitution and manual modification

The selected set of meta-assemblies was aligned with two sets of six meta-scaffolds (HindIII and MBoI) using minimap2 (version 2.17)³³. Variants in each set were identified using the paftools call command. To standardize variant representation, we used the BCFtools norm command (version 1.9). Major allele substitutions were conducted by selecting the allele shared by more than three of six meta-scaffolds in each set. For multiallelic sites with equal allele frequencies, the selection was made randomly. Consecutive N-gap length for heterochromatic regions was manually modified.

JG2 assembly assessment

Consensus quality

The JG2 was aligned with the GRCh38 reference genome using the NUCmer command from the MUMmer software suite²¹. The proportion of covered regions and the average identity between the genomes were calculated using the dnadiff tool, also part of the MUMmer suite³⁴. In addition, assemblies were aligned to the GRCh38 reference genome using minimap2 software, and variants were identified with the paftools call command. The normalized variants were then annotated using the SnpEff software³⁵, referencing the GRCh38.86 database.

Representativeness of JG2 for Japanese variants

To determine whether JG2 harbors the major allele among the Japanese, JG2 was aligned against the reference genome hs37d5 to detect SNVs. Genome-by-genome alignment and comparison were performed using minimap2 and paftools software³³. Allele frequency (AF) was investigated on the 3.5KJPNv2 AF panel^36,37. AF spectra of JG2 and JG1 were created with the horizontal axis representing the non-hs37d5-type allele and the vertical axis showing the number of such variant sites.

Variant calling in 104 JPT samples from the 1000 Genomes Project

High-coverage (30×) CRAM files for 104 JPT (Japanese in Tokyo, Japan) individuals were downloaded from the International Genome Sample Resource. These CRAM files were subsequently converted to paired-end FASTQ format. The resulting FASTQ files were aligned to the respective reference genomes (JG2 and GRCh38) using BWA-MEM¹⁷. Following alignment, duplicate reads were marked, and subsequently, SNVs and indels were called using Strelka2³⁸. Callable regions were defined as 100-bp windows where sequencing depth (DP) was 10 ≤ DP < 100 in ≥90% of the 104 JPT individuals.

Results

JG2 construction

The individuals selected for constructing JG2 were three Japanese male volunteers named jg1a, jg1b and jg1c, the same individuals used to create the previous Japanese reference genome, JG1 (ref. ⁸). For each individual, we obtained over 120× PacBio CLR, two sets of over 49× Hi-C reads (using MboI and HindIII enzymes), one or two sets of over 120× Bionano optical genome maps (using DSL-1 or BspQI and BssSI enzymes), one set of ONT long reads, two sets of mate-pair reads and two sets of paired-end reads (Illumina HiSeq 162-bp and 259-bp reads). The depth of reads or optical genome maps is summarized in Supplementary Table 1.

Genome assembly for each individual

JG2 was created by first performing a phased assembly for each individual and then integrating genomes among the three Japanese individuals (Fig. 1). We performed a phased assembly for each individual using PacBio CLR reads by Falcon, Falcon-unzip and Quiver software of the pb-assembly software suite (Fig. 1A). The assembly statistics, including total length, N50 and the number of contigs, are presented in Supplementary Table 2. After the polishing, the number of primary contigs of jg1a, jg1b and jg1c were 1,439, 1,386 and 1271, respectively. The N50 of all participants is around 20 Mb.

**Fig. 1: Construction strategy of the JG2 reference genome.**

Subsequently, by using two sets of Hi-C data separately, we conducted phased assembly, which generated two sets of haploid genome assemblies per individual using Falcon-Phase software: MboI-based phased assemblies and HindIII-based phased assemblies (Fig. 1A). The assembly statistics are presented in Supplementary Table 3. These assemblies were then polished using 162-bp and 259-bp HiSeq short reads by Pilon software. Scaffolding was conducted on each set of polished haploid genome assemblies of the PacBio/Hi-C phased contigs utilizing SALSA2 software. Specifically, we utilized the Hi-C dataset that was not employed for phasing. For example, we used the MboI Hi-C dataset for scaffolding the HindIII-based phased assemblies. Conversely, we used the HindIII Hi-C dataset for scaffolding the MboI-based phased assemblies (Fig. 1A). Consequently, we obtained four sets of PacBio/Hi-C phased scaffolds for each individual. The assembly statistics are presented in Supplementary Table 3. The number of scaffolds of four sets of phased scaffolds in jg1a, jg1b and jg1c was 997 ± 55 (mean ± s.d.).

In addition to PacBio/Hi-C phased scaffolds, de novo assembly of Bionano optical genome maps was performed by BionanoSolve software, and de novo assembly of ONT nanopore reads was conducted using Shasta, Racon and Medaka software (Fig. 1B). Subsequently, the ONT assemblies were refined by polishing with 162-bp and 259-bp HiSeq short reads using Pilon software. The summary of assembly statistics is presented in Supplementary Table 4.

In summary we acquired four sets of phased PacBio/Hi-C assemblies from each participant, along with one or two sets of Bionano-assembled genome maps and one set of ONT assemblies. These datasets were subsequently used in further meta-assembly.

Meta-assembly within each individual

The meta-assembly process involves integrating the locally best sequences from all input assemblies across the genome. These sequences are merged to create a final sequence that is either as good as or superior to the individual constituent assemblies. For each individual, four sets of phased PacBio/Hi-C assemblies underwent two-step meta-assemblies, resulting in 24 different meta-assemblies (Fig. 1A). Then, the meta-assembly with the fewest mis-assemblies—determined by comparison with genetic/RH maps and JG1—was selected for further hybrid scaffolding using Bionano-assembled genome maps and ONT assembly. Mate-pair short reads were used for meta-assembly by Metassembler software. The meta-assembly statistics for two-step meta-assemblies of phased PacBio/Hi-C assemblies are presented in Supplementary Table 5.

Following that, we conducted hybrid scaffolding of the meta-assembly using Bionano genome maps through BionanoSolve software. Subsequently, the resulting hybrid scaffolds were merged with the ONT assembly utilizing merged dataset 1 mate-pair short reads via Metassembler software. In this step of meta-assemblies within each individual, one meta-assembly per individual was generated. Consequently, we obtained three sets of meta-assemblies for the three individuals.

Meta-assemblies among individuals

Meta-assembly was conducted among the three individuals using mate-pair short reads from dataset 2 through Metassembler software, resulting in 12 sets of meta-assemblies (Fig. 1C). Subsequently, 12 sets of meta-assemblies were anchored to 3 genetic and 6 radiation hybrid maps using ALLMAPS software, resulting in 12 sets of pseudomolecules.

Selection of JG2

By comparing the pseudomolecules anchored to each linkage group with the corresponding chromosome sequence of JG2, one pseudomolecule was chosen for each chromosome sequence for JG2 (Supplementary Table 6). The selected pseudomolecules, mitochondrial reference genome and unplaced scaffolds comprise a full set of the reference genome. Major allele substitution was performed using variants from two sets of six meta-scaffolds (HindIII and MBoI) to remove rare individual variants. Manual modifications were conducted to adjust the N-gap length for the acrocentric, centromeric, heterochromatic and telomeric regions, yielding the final set of genome sequences of JG2.

As all donors were male, the Y chromosome was also assembled. The final length of the Y chromosome was 50,152,051 bp, including 36,355,468 bp of N-gaps. Most of these gaps (35,226,671 bp) were intentionally inserted to represent centromeric and telomeric regions and to mask the pseudoautosomal regions for improved utility in NGS analysis. Excluding these intentional gaps, the resolved sequence was approximately 14 Mb, with remaining undetermined regions constituting less than 8% of this length, indicating a high-quality assembly.

Evaluation of JG2

The procedure described above resulted in a set of chromosome-level sequences for 22 autosomes, 2 sex chromosomes, 1 mitochondrial chromosome and 1148 unplaced scaffolds collectively designated as JG2. The total length of JG2 was approximately 3.1 Gb, which included 609 gap regions totaling 251 Mb. Of these gaps, 233.63 Mb were intentionally inserted to represent telomeric, centromeric and heterochromatic regions. The scaffold N50 was 152,668,378 bp. The number of misassemblies identified in JG2 was 581 when compared against GRCh38, and 503 against T2T-CHM13v2.0³⁹. Dot plot alignments of JG2 against these reference genomes are shown in Fig. 2. JG2 covered 91.11% of the GRCh38 with an average identity of 99.80% and covered 90.82% of the T2T-CHM13v2.0 with an average identity of 99.81%. Comparative analyses of JG2 against hs37d5, AK1, KOREF1, HX1 and the previously constructed JG1 are shown in Supplementary Fig. 1 and Supplementary Table 7.

**Fig. 2: Dot plots show co-linearity between JG2 and other reference genomes.**

Compared with the GRCh38 reference genome, JG2 contained 3,115,695 variants, of which 2130 (0.054%) were classified as high-impact variants. The number of protein-truncating variants, which include stop-gained, frameshift, splice-acceptor and splice-donor variants, was 55, 280, 141 and 118, respectively.

Representativeness of JG2 for SNVs composition of the Japanese population

The genome-by-genome alignment and comparison revealed 2,321,710 SNVs between hs37d5 and JG2 in the autosomes and X chromosome. Of these SNVs, 298,644 had an AF of 1.0 in the 3.5KJPNv2 AF panel, indicating that all Japanese have JG2-type alleles at these 298,644 sites. This number is larger than that of JG1 (246 464), demonstrating that JG2 better represents major alleles in the Japanese population. In addition, we identified 366,364 SNV sites with an AF ≥0.99 and 624,847 SNV sites with an AF ≥0.90 in the 3.5KJPNv2 AF panel, respectively. AF spectra for SNVs comparing JG2 and JG1 are demonstrated in Fig. 3.

**Fig. 3: The AF spectrum of JG2 and JG1.**

Variant calling with JG2 for 104 JPT individuals from 1KGP

To evaluate the utility of JG2 as a population-specific reference genome for the Japanese population, variant calling was performed using high-coverage WGS data from 104 JPT (Japanese in Tokyo) individuals from the 1000 Genomes Project aligned to both JG2 and GRCh38. The number of SNVs and indels called was consistently lower when using JG2. Specifically, the number of SNVs was 2,920,870 ± 17,174 for JG2, compared with 3,772,565 ± 22,010 for GRCh38 (Fig. 4A). Similarly, the number of indels was 771,343 ± 6058 for JG2, compared with 862,144 ± 6299 for GRCh38 (Fig. 4B). While the callable regions were slightly lower for some chromosomes in JG2 (Supplementary Table 8), this minor difference is substantially offset by a disproportionately larger reduction in detected variants (approximately 851,700 fewer SNVs and 90,800 fewer indels on average per individual). This highlights JG2’s improved accuracy for variant analysis in Japanese individuals by better reflecting population-specific sequences and thus reducing false-positive calls attributable to divergence from the reference.

**Fig. 4: Comparison of variant calls and callable regions between JG2 and GRCh38 reference genomes.**

Discussion

The current study demonstrated the strategy to assemble the population-specific reference genome, JG2, from three individuals whose genome data were used to construct the previous version of the Japanese specific reference genome, JG1. By applying advanced strategies, JG2 better covers Japanese-specific variants and enhances assembly statistics. Using Hi-C data for phase-aware assembly, the process generated two haploid assemblies per individual. Unlike JG1, this method treats the haploid genomes as random samples from a population, increasing genetic variation representation. The increase in haplotypes enhanced population representativeness and, through majority decision, reduced the presence of rare personal variants, making JG2 more generalizable to the Japanese population.

Hi-C sequencing technology was instrumental in our phased assembly process. Hi-C data provide long-range contact information that helps to accurately place contigs and scaffolds into their correct positions along the chromosomes, enabling the differentiation of maternal and paternal haplotypes^40,41. This phasing is crucial for understanding the allelic variations and structural differences between the two sets of chromosomes, providing a more complete and accurate representation of the genome.

The meta-assembly strategy, which involved combining the best sequences from multiple assemblies, enhanced assembly quality for the Japanese genome⁸. For each individual, the meta-assembly process led to enhanced scaffold lengths, indicated by the increased N50 in JG2 compared with JG1 (152,668,378 versus 141,953,703). In addition, a majority decision strategy was also applied to remove rare variants. By comparing multiple assemblies and subsequent selection of the consensus sequence, we effectively mitigated the inclusion of sequencing errors and rare personal variants that might not represent the population. This approach reduced the number of misassemblies in JG2 compared with JG1 (1055 versus 1654). This approach ensured that the final assemblies were more accurate and reflective of the common genomic features within the Japanese population. The iterative meta-assembly among individuals further consolidated these improvements, creating comprehensive and representative pseudomolecules for each chromosome. Moreover, variant calling using 104 Japanese individuals with JG2 as the reference resulted in consistently fewer SNVs and indels compared to using GRCh38. This indicates that JG2 provides a better match to the Japanese allele spectrum and improves the accuracy of variant identification for this population.

Despite its advancements, JG2 is based on technology slightly behind the state of the art compared with the recent developments presented by T2T³⁹ and pangenome projects^42,43. T2T utilized high-accuracy long-read sequences such as HiFi reads to construct a high-resolution assembly string graph to address repetitive and complex regions such as centromeres, telomeres and other difficult-to-sequence areas³⁹. In JG2, meanwhile, these regions are represented by N as uncharacterized bases. A pangenome encompasses the collective whole-genome sequences of multiple individuals to capture the genetic diversity of a species, utilizing high-quality, phased haplotypes and a graph-based data structure for improved reference and variant indexing, with compatibility to existing human reference genomes^42,43. However, it is important to note that JG2 has been steadily updated compared with JG1. This continuous improvement ensures that JG2 remains a valuable fundamental information resource that can be used for various analyses, such as identifying population-specific genetic variations or conducting comparative studies within the Japanese population.

In conclusion, this study improved the Japanese reference genome, creating JG2 via phased assembly from three individuals. Hi-C data enabled phase-aware assembly, generating two haploid assemblies per individual and better representing genetic variation. A meta-assembly strategy combined the best sequences, improving scaffold lengths and accuracy, while a majority decision approach minimized rare variants. These methods produced high-quality, comprehensive pseudomolecules, making JG2 more representative of the Japanese population.

Data availability

The JG2 sequences, along with additional resources, are available from the jMorp website³⁷ (https://jmorp.megabank.tohoku.ac.jp/downloads/tommo-jg2.0.0.beta-20200831). Raw sequencing data are accessible under controlled access and require agreement to the Condition of Use, which prohibit reidentification, restrict commercial use and mandate proper citation. Researchers may request access at https://www.dist.megabank.tohoku.ac.jp/.

References

Lander, E. S. et al. Initial sequencing and analysis of the human genome. Nature 409, 860–921 (2001).
Article PubMed CAS Google Scholar
Venter, J. C. et al. The sequence of the human genome. Science 291, 1304–1351 (2001).
Article PubMed CAS Google Scholar
Ameur, A. et al. De novo assembly of two Swedish genomes reveals missing segments from the human GRCh38 reference and improves variant calling of population-scale sequencing data. Genes 9, 486 (2018).
Cho, Y. S. et al. An ethnically relevant consensus Korean reference genome is a step towards personal reference genomes. Nat. Commun. 7, 13637 (2016).
Article PubMed PubMed Central CAS Google Scholar
Seo, J.-S. et al. De novo assembly and phasing of a Korean human genome. Nature 538, 243–247 (2016).
Article PubMed CAS Google Scholar
Shi, L. et al. Long-read sequencing and de novo assembly of a Chinese genome. Nat. Commun. 7, 12065 (2016).
Article PubMed PubMed Central CAS Google Scholar
Ouzhuluobu et al. De novo assembly of a Tibetan genome and identification of novel structural variants associated with high-altitude adaptation. Natl Sci. Rev. 7, 391–402 (2019).
Article PubMed Central Google Scholar
Takayama, J. et al. Construction and integration of three de novo Japanese human genome assemblies toward a population-specific reference. Nat. Commun. 12, 226 (2021).
Article PubMed PubMed Central CAS Google Scholar
Uneoka, S. et al. A case series of patients with MYBPC1 gene variants featuring undulating tongue movements as myogenic tremor. Pediatr. Neurol. 146, 16–20 (2023).
Article PubMed Google Scholar
Shibuya, M. et al. A patient with early-onset SMAX3 and a novel variant of ATP7A. Brain Dev. 44, 63–67 (2022).
Article PubMed CAS Google Scholar
Katata, Y. et al. The longest reported sibling survivors of a severe form of congenital myasthenic syndrome with the pathogenic variant. Am. J. Med. Genet. Part A 188, 1293–1298 (2022).
Article PubMed CAS Google Scholar
Ito, S. et al. A novel 8.57-kb deletion of the upstream region of PRKAR1A in a family with Carney complex. Mol. Genet. Genom. Med. 10, e1884 (2022).
Article CAS Google Scholar
Chin, C.-S. et al. Phased diploid genome assembly with single-molecule real-time sequencing. Nat. Methods 13, 1050–1054 (2016).
Article PubMed PubMed Central CAS Google Scholar
Kronenberg, Z. N. et al. Extended haplotype-phasing of long-read de novo genome assemblies using Hi-C. Nat. Commun. 12, 1935 (2021).
Article PubMed PubMed Central CAS Google Scholar
Ghurye, J. et al. Integrating Hi-C links with assembly graphs for chromosome-scale assembly. PLoS Comput. Biol. 15, e1007273 (2019).
Article PubMed PubMed Central CAS Google Scholar
Lieberman-Aiden, E. et al. Comprehensive mapping of long-range interactions reveals folding principles of the human genome. Science 326, 289–293 (2009).
Article PubMed PubMed Central CAS Google Scholar
Li, H. Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. Preprint at https://arxiv.org/abs/1303.3997 (2013).
Li, H. et al. The Sequence Alignment/Map format and SAMtools. Bioinformatics 25, 2078–2079 (2009).
Article PubMed PubMed Central Google Scholar
Walker, B. J. et al. Pilon: an integrated tool for comprehensive microbial variant detection and genome assembly improvement. PLoS ONE 9, e112963 (2014).
Article PubMed PubMed Central Google Scholar
Wences, A. H. & Schatz, M. C. Metassembler: merging and optimizing de novo genome assemblies. Genome Biol. 16, 207 (2015).
Article PubMed PubMed Central Google Scholar
Marçais, G. et al. MUMmer4: a fast and versatile genome alignment system. PLOS Comput. Biol. 14, e1005944 (2018).
Article PubMed PubMed Central Google Scholar
O’Connell, J. et al. NxTrim: optimized trimming of Illumina mate pair reads. Bioinformatics 31, 2035–2037 (2015).
Article PubMed Google Scholar
Langmead, B. & Salzberg, S. L. Fast gapped-read alignment with Bowtie 2. Nat. Methods 9, 357–359 (2012).
Article PubMed PubMed Central CAS Google Scholar
Dib, C. et al. A comprehensive genetic map of the human genome based on 5,264 microsatellites. Nature 380, 152–154 (1996).
Article PubMed CAS Google Scholar
Broman, K. W., Murray, J. C., Sheffield, V. C., White, R. L. & Weber, J. L. Comprehensive human genetic maps: individual and sex-specific variation in recombination. Am. J. Hum. Genet 63, 861–869 (1998).
Article PubMed PubMed Central CAS Google Scholar
Kong, A. et al. A high-resolution recombination map of the human genome. Nat. Genet. 31, 241–247 (2002).
Article PubMed CAS Google Scholar
Stewart, E. A. et al. An STS-based radiation hybrid map of the human genome. Genome Res 7, 422–433 (1997).
Article PubMed CAS Google Scholar
Olivier, M. et al. A high-resolution radiation hybrid map of the human genome draft sequence. Science 291, 1298–1302 (2001).
Article PubMed CAS Google Scholar
Agarwala, R., Applegate, D. L., Maglott, D., Schuler, G. D. & Schäffer, A. A. A fast and scalable radiation hybrid map construction and integration strategy. Genome Res 10, 350–364 (2000).
Article PubMed PubMed Central CAS Google Scholar
Deloukas, P. et al. A physical map of 30,000 human genes. Science 282, 744–746 (1998).
Article PubMed CAS Google Scholar
Hudson, T. J. et al. An STS-based map of the human genome. Science 270, 1945–1954 (1995).
Article PubMed CAS Google Scholar
Tang, H. et al. ALLMAPS: robust scaffold ordering based on multiple maps. Genome Biol. 16, 3 (2015).
Article PubMed PubMed Central CAS Google Scholar
Li, H. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics 34, 3094–3100 (2018).
Article PubMed PubMed Central CAS Google Scholar
Phillippy, A. M., Schatz, M. C. & Pop, M. Genome assembly forensics: finding the elusive mis-assembly. Genome Biol. 9, R55 (2008).
Article PubMed PubMed Central Google Scholar
Cingolani, P. et al. A program for annotating and predicting the effects of single nucleotide polymorphisms, SnpEff: SNPs in the genome of Drosophila melanogaster strain w1118; iso-2; iso-3. Fly 6, 80–92 (2012).
Article PubMed PubMed Central CAS Google Scholar
Tadaka, S. et al. 3.5KJPNv2: an allele frequency panel of 3552 Japanese individuals including the X chromosome. Hum. Genome Var. 6, 28 (2019).
Article PubMed PubMed Central Google Scholar
Tadaka, S. et al. jMorp: Japanese Multi-Omics Reference Panel update report 2023. Nucleic Acids Res. 52, D622–D632 (2024).
Article PubMed CAS Google Scholar
Kim, S. et al. Strelka2: fast and accurate calling of germline and somatic variants. Nat. Methods 15, 591–594 (2018).
Article PubMed CAS Google Scholar
Nurk, S. et al. The complete sequence of a human genome. Science 376, 44–53 (2022).
Article PubMed PubMed Central CAS Google Scholar
Rao, S. S. et al. A 3D map of the human genome at kilobase resolution reveals principles of chromatin looping. Cell 159, 1665–1680 (2014).
Article PubMed PubMed Central CAS Google Scholar
Dudchenko, O. et al. De novo assembly of the Aedes aegypti genome using Hi-C yields chromosome-length scaffolds. Science 356, 92–95 (2017).
Article PubMed PubMed Central CAS Google Scholar
Wang, T. et al. The Human Pangenome Project: a global resource to map genomic diversity. Nature 604, 437–446 (2022).
Article PubMed PubMed Central CAS Google Scholar
Liao, W.-W. et al. A draft human pangenome reference. Nature 617, 312–324 (2023).
Article PubMed PubMed Central CAS Google Scholar

Download references

Acknowledgements

This work was supported in part by the Tohoku Medical Megabank (TMM) Project from the Ministry of Education, Culture, Sports, Science and Technology (MEXT) and by the Japan Agency for Medical Research and Development (AMED; grant number JP21tm0124005) for Tohoku University. This work was also supported in part by JST Moonshot R&D Program grant number JPMJMS2023 to G.T. and J.T. All computational resources were provided by the ToMMo supercomputer system (http://sc.megabank.tohoku.ac.jp/en), which is supported by Facilitation of R&D Platform for AMED Genome Medicine Support conducted by AMED (grant number JP21tm0424601). We appreciate all the volunteers who participated in the TMM project.

Author information

Authors and Affiliations

Department of AI and Innovative Medicine, Tohoku University School of Medicine, Sendai, Japan
Sirawit Sriwichaiin, Takamitsu Funayama, Gen Tamiya & Jun Takayama
Tohoku Medical Megabank Organization, Tohoku University, Sendai, Japan
Satoshi Makino, Takamitsu Funayama, Akihito Otsuki, Junko Kawashima, Yasunobu Okamura, Shu Tadaka, Fumiki Katsuoka, Kazuki Kumada, Kengo Kinoshita, Masayuki Yamamoto, Gen Tamiya & Jun Takayama
Advanced Research Center for Innovations in Next-Generation Medicine, Tohoku University, Sendai, Japan
Yasunobu Okamura, Fumiki Katsuoka & Kengo Kinoshita
Genome Science Division, Research Center for Advanced Science and Technology, The University of Tokyo, Tokyo, Japan
Shuichi Tsutsumi
Department of Applied Information Sciences, Graduate School of Information Sciences, Tohoku University, Sendai, Japan
Kengo Kinoshita
Department of In Silico Analyses, Institute of Development, Aging and Cancer, Tohoku University, Sendai, Japan
Kengo Kinoshita
Department of Biochemistry and Molecular Biology, Tohoku Medical Megabank Organization, Tohoku University, Sendai, Japan
Masayuki Yamamoto
Statistical Genetics Team, RIKEN Center for Advanced Intelligence Project, Tokyo, Japan
Gen Tamiya & Jun Takayama

Authors

Sirawit Sriwichaiin
View author publications
Search author on:PubMed Google Scholar
Satoshi Makino
View author publications
Search author on:PubMed Google Scholar
Takamitsu Funayama
View author publications
Search author on:PubMed Google Scholar
Akihito Otsuki
View author publications
Search author on:PubMed Google Scholar
Junko Kawashima
View author publications
Search author on:PubMed Google Scholar
Yasunobu Okamura
View author publications
Search author on:PubMed Google Scholar
Shu Tadaka
View author publications
Search author on:PubMed Google Scholar
Fumiki Katsuoka
View author publications
Search author on:PubMed Google Scholar
Kazuki Kumada
View author publications
Search author on:PubMed Google Scholar
Shuichi Tsutsumi
View author publications
Search author on:PubMed Google Scholar
Kengo Kinoshita
View author publications
Search author on:PubMed Google Scholar
Masayuki Yamamoto
View author publications
Search author on:PubMed Google Scholar
Gen Tamiya
View author publications
Search author on:PubMed Google Scholar
Jun Takayama
View author publications
Search author on:PubMed Google Scholar

Corresponding author

Correspondence to Jun Takayama.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Fig. 1

Supplementary Tables

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Sriwichaiin, S., Makino, S., Funayama, T. et al. JG2: an updated version of the Japanese population-specific reference genome. Hum Genome Var 12, 21 (2025). https://doi.org/10.1038/s41439-025-00326-y

Download citation

Received: 31 May 2025
Revised: 28 July 2025
Accepted: 28 July 2025
Published: 01 October 2025
Version of record: 01 October 2025
DOI: https://doi.org/10.1038/s41439-025-00326-y