Background & Summary

Molecular dynamics (MD) simulations are powerful tools for elucidating molecular-level behavior not only in biomolecular systems but also in polymer material sciences1,2,3,4. In MD simulations, coordinate data are recorded for detailed analyses. For such analyses, it is necessary to develop mathematical methods that can accurately evaluate how the linear chain penetrates the ring polymer; this has long been an important problem in the mathematics of topology5,6,7,8,9,10,11,12,13,14. The relevance of this task is not limited to ring-linear polymer blends13,14; research on knots in proteins15,16,17,18, threading of ring polymers19,20,21,22, and cross-linked networks23,24 is greatly concerned with the linkage between loops and chains owing to its impact on the material properties. Therefore, public availability of MD coordinate data is expected to promote the development of analysis methods by applied mathematicians.

Recently, there has been increasing attention in the field of polymer materials on mixed systems of ring and linear polymers. This is because recent experimental results have demonstrated the toughness of cross-linked ring-linear polymer blends25,26. Here, ring polymers work as movable cross-linking points to prevent stress concentration25,26. To understand these systems, it is important to first conduct detailed investigations of the equilibrium states of the ring-linear polymer blends. The equilibrium state can be obtained by long-term MD simulations13,14 in systems with a large number of ring and linear polymers; however, this is not an easy task. Thus, it is desirable to improve global efficiency through data sharing and reuse instead of duplicating calculations for multiple groups.

A mechanism for the efficient sharing data with reduced data sizes is important because datasets of MD trajectory data are typically very large. Moreover, compression of floating-point data is a common problem for scientific simulations in high-performance computing (HPC)27,28,29,30,31,32,33,34,35,36. Some studies on data compression27,28,29 found that the tailing fraction bits are too random to effectively compress because the tail bits in the fraction part of floating-point values in scientific data are more random than the head bits. Methods to neglect tail bits include error-controlled lossy data-compression methods such as ZFP30, ISABELA31, SSEM32, and SZ33,34,35,36. Recently, comparisons of compressor performance have been performed using benchmark data in various scientific domains; for example, for ZFP and SZ by Lu et al.37, Tao et al.38, and Cappello et al.39. As a result, SZ is regarded as a standard efficient compressor in HPC research for exascale computing. Note that Di and Cappello40 reported that time-trajectory analysis-based compressors41,42,43,44,45,46,47,48 become impractical in extremely large-scale particle simulations owing to their limited memory capacity. Thus, we focus on the data compression of snapshots.

For lossy compression of MD trajectory data in polymer systems, the required numerical accuracy (error level) and physical meanings such as preservation of topology should remain unchanged. Moreover, in the bit string of the coordinate data in polymer systems, the bits in the sequence along a chain have similar characteristics to time-series data in scientific simulations. Several authors29,49,50 have proposed the Jointed Hierarchical Precision Compression Number - Data Format (JHPCN-DF) method, which is a hierarchical segmented recording based on the required numerical precision (error level).

In this study, we analyze the relationship between the numerical accuracy and topology preservation of polymer MD trajectory data under JHPCN-DF compression with the aim of developing a publicly available database. The examined datasets consist of multiple melt systems with a mixture of ring polymers and linear chains. These datasets were prepared as well-equilibrated initial configurations for subsequent MD simulations in order to measure the rheological51 and mechanical properties after setting crosslinks. Note that these shared dataset provided the first successful discovery51 of a viscosity overshoot under biaxial extensional flows. In addition, these datasets are appropriate for the development of more accurate and rigorous mathematical judgment methods52, as well as efficient approximation techniques based on primitive path (PP) analysis53. As these datasets provide equilibrium states, they can also be useful for developing further coarse-grained MD models that reproduce these states54 and planning neutron scattering experiments to observe ring shapes in ring-linear blends. Moreover, publicly available data of polymer systems can be used as a benchmark dataset in the data-compression research community.

Method

Molecular dynamics simulations of ring-linear polymer blends

We generated a dataset that included all combinations of the parameter conditions shown in Table 1 by performing MD simulations13,14. In all cases, MD simulations with a long length of 109 MD steps were performed to obtain a well-equilibrated configuration of ring-linear polymer blends. Figure 1 presents schematics of the ring complexes. The examined system size was approximately 600,000 beads. The box sizes of the periodic boundary condition (PBC) were approximately (80)3 in the scale units. Note that the numbers of ring polymers and linear chains were included in the filename for each binary file.

Table 1 Parameter conditions.
Fig. 1
figure 1

Schematics of single ring, bonded-rings, poly-catenanes, and ring-linear mixture. The snapshot of the ring-linear mixture with primitive path (PP)53 presentations for Nring = Nlinear = 160 with ring fraction 0.1 was rendered by OVITO60. In (f), ring polymers and linear chains are shown in red and green, respectively. The ends of linear chains are shown in blue.

To obtain equilibrated configurations of ring-linear polymer blends, we performed coarse-grained MD simulations of the Kremer-Grest model55. Ring polymers with bead number Nring and linear chains with length Nlinear were placed in a box with PBCs, where the numbers of ring and linear polymers were Mring and Mlinear, respectively. The length of each simulation run was 109 MD steps with a time step (Δt) of 0.005τ, where τ is a time unit.

In the KG model, the Lennard–Jones (LJ) potential with a cutoff length of rc was applied to every pair of particles.

$${U}_{{\rm{LJ}}}(r)=4\varepsilon \left[{\left(\frac{\sigma }{r}\right)}^{12}-{\left(\frac{\sigma }{r}\right)}^{6}-{\left(\frac{\sigma }{{r}_{{\rm{c}}}}\right)}^{12}+{\left(\frac{\sigma }{{r}_{{\rm{c}}}}\right)}^{6}\right]$$

when r < rc, whereas ULJ(r) = 0 when r ≥ rc, where r is the distance between the beads, ε is the interaction strength, σ is the scale unit, and rc is the cutoff length of the interaction. For simplicity, we set ε =  σ = 1 hereafter. To reproduce the excluded volume of chains with minimal computing costs, we set rc to 21/6. For bonded beads, the finite extensible nonlinear elastic (FENE) potential was also applied, where

$${U}_{{\rm{F}}{\rm{E}}{\rm{N}}{\rm{E}}}(r)=-\,\frac{k}{2}{R}_{0}^{2}\,{\rm{l}}{\rm{n}}\left[1-{\left(\frac{r}{{R}_{0}}\right)}^{2}\right]$$

for r < R0 and UFENE(r) = 0 for r ≥ R0. Here, k is the spring constant and R0 is the maximum bond length. The LJ and FENE potentials with k = 30 and R0 = 1.5 are widely used to prevent chains from crossing each other. The ring and linear polymers were placed in a box under PBCs with a bead number density of 0.85. Additionally, all ring polymers were unconcatenated. The bead dynamics in our model were described by a Langevin equation with a friction constant () of 0.5 −1 and a temperature T. For simplicity, we set the mass of a bead (m) to unity so that T and LJ time (τ = σ(m/ε)1/2) became unity. The velocity Verlet algorithm was used for numerical integration of the Langevin equation. In this study, we used LAMMPS56 and HOOMD-blue57 MD simulation software.

Topology judgement method of chain-penetration into a ring

We evaluated the Gauss Linking Numbers (GLNs) for all ring–linear pairs. However, GLNs cannot be applied to a ring and a linear chain unless the latter is a closed loop. In practice, the ends of linear chains are virtually connected to each other, but we prepared an extra linear chain and connected it to the original linear chain to form a cyclic chain. Details of this method were given in our previous work13,14. To compute GLNs among cyclic chains and ring polymers, we used the Topoly Python package58. For a catenated cyclic chain and ring pair, the GLN was equal to 1. Otherwise, GLN = 0. When GLN = 1, we concluded that the linear chain had penetrated the target ring chain.

Efficient compression of floating data

To achieve efficient sharing of lossy and lossless compressed data, the JHPCN-DF method29,49,50 was used for hierarchical segmented recording based on the required numerical precision (error level). In essence, the JHPCN-DF framework involves lossless compression with segmented recording; for users who employ parts of the recording, it works as lossy compression. One of the merits of this framework is a substantial reduction of data transfer from big supercomputers to front-end computers for data confirmation through visualization. It should be noted that the part of compression related to the first fraction bits can be regarded as the same as masked data compression28, which was proposed independently by Gomez and Cappello.

The required number of bits in the IEEE 754 format differs for different purposes such as visualization and analysis of scientific data, as shown in Fig. 2. Thus, the required number of bits needs to be properly evaluated for each purpose and simulation target. In scientific simulations using the laws of physics, the first fraction bits are correlated in space and time. However, the tailing fraction bits do not always contribute to visualization and analyses and may instead exhibit random noise-like behavior, which negatively affects data compression27,28,29,49,50. A higher compression ratio using only the first fraction bits can be observed if the tailing fraction bits can be neglected. Regarding compression efficiency, both data size and ease of use should be considered. For the latter, a simple solution should not change the Application Programming Interface (API). Thus, the conventional binary format with Huffman coding (ex. gzip), and HDF5 can be used as the data API. A combination of zero padding and data compression (such as Huffman coding) can be effective because the size of information in the zero padded bits becomes negligibly small after Huffman coding.

Fig. 2
figure 2

Required number of bits for visualization and analysis.

In our implementation29,49,50, the required bit length of each floating-point data was checked for user-specified error levels, such as 0.000001. For the case of IEEE 754 double-precision floating-point data, the stored value of the original variable requires zero padding and a 64-bit integer to record the separated bits necessary to reconstruct higher precision data and the original data (lossless). The recordings in the separated binary files using the JHPCN-DF framework are presented in Fig. 3. In this example, 64 bits of double-precision data were split into three parts: [24 bits + 0-padding (40 bits)], [0-padding (24 bits) + 17 bits + 0-padding (23 bits)], and [0-padding (41 bits) + 23 bits]. Before Huffman coding, the total size of the original 64 bits was 192 bits in memory. After Huffman coding, the total size of the original 64 bits became less than 64 bits. For decoding, the OR-operation for the separated data reconstructs original (lossless) data and/or higher precision data. For the example shown in Fig. 3, lossless data can be obtained using the OR-operation for three 64-bit data recordings: OR([24 bits + 0-padding (40 bits)], [0-padding (24 bits) + 17 bits + 0-padding (23 bits)], and [0-padding (41 bits) + 23 bits]).

Fig. 3
figure 3

Example application of separated binary files created within the JHPCN-DF. In this example, the required number of bits was 24 bits and 41 bits for visualization and analysis, respectively. 64 bits of double-precision data were split into three 64-bit recordings: [24 bits + 0-padding (40 bits)], [0-padding (24 bits) + 17 bits + 0-padding (23 bits)], and [0-padding (41 bits) + 23 bits]. Huffman cording reduced the total size of the original 64 bits to less than 64 bits.

Data Records

The dataset59 consists of 150 systems of ring-linear polymer blends, as shown in Table 1. The datasets are available via the Figshare repository.

Dataset 1

Each filename contains information on the type of ring complex: Nring, Mring, Nlinear, Mlinear, and fring. For example, “TwoB_NR120x240_NL20x28800_fr005-D-jhpcndf000001” indicates that the complex was bonded to two ring polymers (as shown in Fig. 3(b)), Nring = 120, Mring = 240, Nlinear = 20, Mlinear = 28,800, and fring = 0.05. The types of ring complex are indicated by “One,” “TwoB,” “ThreeB,” “TwoC,” and “ThreeC,” which correspond to Fig. 3(a–e), respectively. Note that “D-jhpcndf000001” indicates double-precision binary with JHPCN-DF compression and an error level of 0.00001.

Each file contains the following data:

  • Size of PBC box (3 × 8 bytes)

  • Positions of beads (3 × Ntotal × 8 bytes)

    Here, Ntotal = Nring Mring + Nlinear Mlinear. Moreover, 3 × Nring × Mring × 8 bytes in the second line indicates the positions of the ring polymers. The remaining data indicate the positions of linear chains. In this database, we assumed that the bead order represents the bond connection. Nring beads made a single ring polymer, whereas Nlinear beads made a linear chain.

    In addition, the tailing fraction bits of bead positions were also provided with int64 binary; these are indicated with “D-jhpcndf000001XOR” to denote JHPCN-DF compression and the tailing (XOR) parts. Here, the tailing fraction bits were obtained from the XOR-operation between the original data and the double-precision binary with JHPCN-DF compression.

  • Tailing fraction bits of positions of beads (3 × Ntotal × 8 bytes)

Technical Validation

Evaluation of segmented recording data

For the double-precision data generated in the MD simulations, we applied JHPCN-DF compression with user-specified error levels of 0.00001, 0.000001, and 0.0000001. For tests of single-precision binary data, single-precision data were obtained by casting from double-precision data. For single-precision binary analysis, we examined cases with user-specified error levels of 0.1, 0.01, 0.001, and 0.0001. Here, 0.0001 was smaller than the limit from the value range, as mentioned below.

Tables 2 and 3 present the size [bytes] and compression ratio of compressed files for single and double-precision binary recording. Here, we employed three methods to achieve the specified error level of the compressed files: (1) “tar” and “gzip −9” for the segmented recording binary file based on JHPCN-DF, (2) “tar” for the “sz”-compressed file of the lossless binary file, and (3) “tar” for the “sz”-compressed file of the segmented recording binary file with JHPCN-DF. Here, we used version 2.1.8.3 of SZ with the Zstd best compression mode36. In the process of generating the compressed files, we monitored the maximum and minimum values of positions: Max = 1981.244394305023 and Min = −1806.817917672729. It should be noted that these values may be inaccurate with single precision. In the case of single precision, from this range and fraction part of 23-bits, as (Max − Min)/223 was approximately 0.00045, the error level cannot be maintained even for a single-precision binary without JHPCN-DF. According to the obtained compression ratios, the results for all compression methods were similar. For all cases, the combination of JHPCN-DF and the SZ-compressor showed the best performance. It should be noted that the increased size of SZ-compressed files for single-precision data with a specified error level of 0.0001 may be a result of insufficient detail parameter tuning. Further analysis of this hypothesis is beyond the scope of this paper.

Table 2 Single-precision binary recording: compressed file size [bytes] and confusion matrix of topology judgments.
Table 3 Double-precision binary recording: compressed file size [bytes] and confusion matrix of topology judgments.

Topology analyses using segmented recording data

As a test for the segmented recording data, we evaluated the GLN for topology judgment regarding penetration of a linear chain into a ring polymer using the method proposed by the authors13,14. This is because the topology is not conserved if the numerical accuracy is poor. The ratio of correct answers of the topology judgment was used as the evaluation index, which was obtained for several user-specified accuracies. Tables 2 and 3 present the confusion matrix and error ratio of the topology judgment for all pairs of ring polymers and linear chains in all systems. Here, the confusion matrix has been effectively employed as a two-class classification problem in machine learning and is given as [[True Positive (TP), False Negative (FN)], [False Positive (FP), True Negative (TN)]], where “Positive” means that the linear chain penetrated into the ring polymer and “True” means that the topology was preserved between lossless compression and the specified error level. The error ratio was defined as (FP + FN)/(TP + FP + FN + TN).

According to the single-precision binary recording in Table 2, increasing the error level (tolerance) increases misjudgment of the topology. This test provides a good example of the relationship between numerical precision and topology judgment errors. However, regarding the original purpose of achieving recording with topology conservation, the single-precision binary format was insufficient. Moreover, the double-precision data in Table 3 exhibited no error in topology judgment with an error level of 0.00001, whereas the single-precision data exhibited two errors. Consequently, we used the JHPCN-DF method with an error level of 0.00001 to develop the publicly available database of well-equilibrated initial configurations of ring-linear polymer blends.

We also investigated the influence of the size of linear chains (Nlinear) because an incorrect judgment is more likely for shorter linear chains due to the limitation of the topology judgment algorithm between a ring polymer and a linear chain13. Tables 4 and 5 present the Nlinear dependence of the error ratio of topology judgments. If the error ratio can be optimized for this problem, compression with an error level corresponding to Nlinear is justified.

Table 4 Single-precision binary recording: Nlinear-dependence of the error ratio of topology judgments.
Table 5 Double-precision binary recording: Nlinear-dependence of the error ratio of topology judgments.