Background & Summary

Over the last half of a century, collecting accurate, line-by-line spectroscopic data for isotopologues of the water molecule has been a major research activity in a large number of spectroscopic laboratories (see, for instance, Refs. 1,2,3,4,5,6,7,8,9,10 and references cited therein). An important contribution toward the detailed understanding of high-resolution spectra recorded for several water isotopologues, beyond selecting particular data for a particular database, started two decades ago, when a Task Group (TG) was set up by the International Union of Pure and Applied Chemistry (IUPAC) on “A Database of Water Transitions from Experiment and Theory” (Project No. 2004-035-1-100). This TG, formed by experimental and computational spectroscopists, published validated sets of measured rovibrational transitions and empirical energy levels on nine water isotopologues, \({{\rm{H}}}_{2}^{\,x}{\rm{O}}\)1,2,3, HDxO2, and \({{\rm{D}}}_{2}^{\,x}{\rm{O}}\)4 (x = 16, 17, 18).

A significant update of the IUPAC TG water data5 was published by four of the five authors of this paper in 2020, in the form of the W2020 database7,8, for the three \({{\rm{H}}}_{2}^{\,x}{\rm{O}}\) species. During the development of the W2020 datasets7,8, the spectroscopic data of the three isotopologues were considered jointly8, allowing improvements to be made for the individual datasets. The W2020-\({{\rm{H}}}_{2}^{\,x}{\rm{O}}\) line lists were successfully employed in the latest edition of HITRAN10, the canonical source of line-by-line spectroscopic information for species of atmoshperic interest, representing about 85% of the  ≈  233 000 lines with complete assignment in the HITRAN-\({{\rm{H}}}_{2}^{\,x}{\rm{O}}\) catalogs.

Accurate, high-resolution spectroscopic information on various water isotopologues is required by numerous complex applications, including climate-change and atmospheric research, astronomy, combustion chemistry, metrology, planetary science, and remote sensing6,11,12,13, all with vastly different environments. The experimental studies of water spectra have been aided by the development of high-resolution and ultrahigh-precision techniques, such as cavity ring-down spectroscopy (CRDS)14 and noise-immune cavity-enhanced optical heterodyne molecular spectroscopy (NICE-OHMS)15,16,17,18,19,20,21,22. Theoretical interpretation of complex (ultra)high-resolution spectra also had to be improved. With the arrival of the fourth age of quantum chemistry23 came the ability to compute nearly complete line lists for molecules, like those constructed under the aegis of the ExoMol project24,25,26,27. Novel algorithms have also been devised which can cope with experimental data of vastly different accuracy23,28,29.

A notable theoretical advancement in high-resolution spectroscopy was the introduction of the concept of spectroscopic networks (SN)30,31,32,33,34. SNs form the basis of the MARVEL (Measured Active Rotational-Vibrational Energy Levels) procedure30,31,32,34,35,36,37, a global spectrum analysis tool7,33,34,38,39,40. MARVEL inverts the information contained in experimental line positions and delivers empirical energy levels41 with individual uncertainties. MARVEL has been used to study the spectra of several diatomic42,43,44,45,46,47, triatomic48,49,50,51,52, tetratomic53,54,55, and larger56,57 species.

A number of developments since the publication of the extensive W2020-\({{\rm{H}}}_{2}^{\,16}{\rm{O}}\) dataset have made its reexamination desirable. Most importantly, results from carefully designed precision-spectroscopy experiments have become available for four water isotopologues29,58,59,60,61,62,63. These studies, in particular, yielded empirical energies, accurate to a few kHz, for a large number of lower states in the experimental SN of \({{\rm{H}}}_{2}^{\,16}{\rm{O}}\). Further new experimental studies have also appeared64,65,66,67,68,69. In particular, experimentalists published a number of measured lines, with a typical uncertainty of 10−3–10−5 cm−164,66,6870, challenging certain database entries of the W2020-\({{\rm{H}}}_{2}^{\,x}{\rm{O}}\) lists. Avoiding the criticism directed towards a subset of empirical W2020-\({{\rm{H}}}_{2}^{\,x}{\rm{O}}\) energy levels66, some already refuted in ref. 65, requires further improvements on how the experimental information is handled during a MARVEL-type analysis.

The research behind in this paper focused on the (a) refinement of the W2020-\({{\rm{H}}}_{2}^{\,16}{\rm{O}}\) database via an improved MARVEL methodology, leading to the W2024 dataset, and (b) construction of a large composite line list, called CW2024, for the \({{\rm{H}}}_{2}^{\,16}{\rm{O}}\) molecule. The (C)W2024 datasets are compared to HITRAN 202010, to expedite the inclusion of the (C)W2024 data in spectroscopic information systems.

Methods

Improved MARVEL methodology

During our studies devoted to MARVEL-based analyses of high-resolution rovibronic spectra of small, usually atmospherically and astronomically relevant molecules42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57, novel aspects and analysis tools have constantly been introduced. These became essential features in later versions of the MARVEL approach and they are described in a number of publications7,33,34,38,39,40. Nevertheless, the ever-expanding spectroscopic information available for \({{\rm{H}}}_{2}^{\,16}{\rm{O}}\) made it necessary to further improve our MARVEL-based analysis technique, as outlined below.

In what follows, the \({{\rm{H}}}_{2}^{\,16}{\rm{O}}\) energy levels are labelled as \(({v}_{1}\,{v}_{2}\,{v}_{3}){J}_{{K}_{a},{K}_{c}}\), whereby v1, v2, and v3 are the normal-mode quantum numbers of the symmetric stretch, bend, and antisymmetric stretch motions, respectively, J is the overall rotational quantum number, while Ka and Kc symbolize the conventional prolate- and oblate-top rotational quantum numbers, respectively. As usual, \(({v}_{1}^{{\prime} }\,{v}_{2}^{{\prime} }\,{v}_{3}^{{\prime} }){J}_{{K}_{a}^{{\prime} },{K}_{c}^{{\prime} }}^{{\prime} }\leftarrow ({v}_{1}^{{\prime\prime} }\,{v}_{2}^{{\prime\prime} }\,{v}_{3}^{{\prime\prime} }){J}_{{K}_{a}^{{\prime\prime} },{K}_{c}^{{\prime\prime} }}^{{\prime\prime} }\) denotes a rovibrational transition, where ′ and ″ signify the upper and lower states, respectively71.

Multiplet constraints

Under favorable circumstances, a spectral line representing a transition between two states is well separated from all neighboring lines, yielding a unique position for it. If two or more transitions are closer to each other than can be resolved by a particular experiment, the lines form an unresolved multiplet. In such cases, (a) the spectral line shape might become distorted, (b) the observed intensity corresponds to the sum of intensities of the individual transitions, and (c) the measured position will be an intensity-weighted average of the unknown individual positions. How the treatment of these multiplets was introduced to MARVEL is described next.

Let the wavenumber of the ith line in the dataset, σi, be represented with the following expression:

$${\sigma }_{i}\approx {S}_{i}\equiv \mathop{\sum }\limits_{j=1}^{{N}_{{\rm{T}}}}{u}_{ij}{s}_{j},$$
(1)

whereby NT is the number of transitions within the database, si means the exact (unknown) position of the ith transition, and the uij entries are the relative weights satisfying \({u}_{11}+{u}_{12}+\ldots +{u}_{1{N}_{{\rm{T}}}}=1\). When the (ij) line pair is part of the same multiplet, uij will be the relative intensity of the jth line in this multiplet; otherwise, uij = 0. For example, if (1, 2) means an unresolved (ortho, para) doublet of \({{\rm{H}}}_{2}^{\,16}{\rm{O}}\), then σ1 = σ2 ≈ 0.75s1 + 0.25s2. Note that equation (1) holds for an isolated line, as well, then uii = 1 and uij = 0 for all j ≠ i.

According to quantum mechanics, the sj wavenumber of a transition is subject to the Ritz principle72,

$${s}_{j}={E}_{{\rm{up}}(j)}-{E}_{{\rm{low}}(j)},$$
(2)

where up(j) and low(j) symbolize the indices of the upper and lower states of the jth line, respectively, and Ek is the (unknown) energy value of the kth quantum state within the transition dataset. Combining equation (1) and equation (2), the following least-squares objective function can be prescribed for MARVEL:

$${\Omega }({\bf{E}})=\mathop{\sum }\limits_{i=1}^{{N}_{{\rm{T}}}}{w}_{i}{\left[{\sigma }_{i}-{S}_{i}({\bf{E}})\right]}^{2},$$
(3)

whereby wi is the statistical (MARVEL) weight of the ith transition and E is the vector of unknown energy values (variables) in the Si(E) ≡ Si sums. If an \(\bar{{\bf{E}}}\) vector minimizes the (quadratic) objective function Ω(E), its entries are called empirical (MARVEL) energies [for details on how to calculate these MARVEL energies, see Supplementary Information (A)].

A drawback of applying multiplet constraints is that they reduce the number of statistical degrees of freedom, nDOF, in the database. In effective Hamiltonian (EH) fits, where such constraints are often employed, e.g., within the SPFIT code73, this is not a problem, as EH models contain much fewer fitting parameters than MARVEL; thus, they can tolerate a decreased nDOF value. Accordingly, to make the MARVEL equations solvable, the input dataset must be complemented with accurate estimates for the relative positions of the individual lines within unresolved multiplets. A feasible way on how to find such estimates is proposed in the next subsection.

Use of computed energy-level splittings and relative positions

As evidenced multiple times, also for water isotopologues29,58,60,61,62, energy differences of rovibrational state pairs pertaining to the same vibrational band can be accurately derived from first-principles solution of the nuclear Schrödinger equation. This favorable state of affairs is due to the utilization of exact kinetic energy operators and the fact that discrepancies arising from deficiencies, such as local inaccuracies in the model potential energy surface (PES) employed, are largely systematic, leading to considerable error cancellation when energy differences between highly similar state pairs are formed. The same holds for the relative position of two lines sharing their upper and lower vibrational parents, as it can be obtained from the (signed) splittings of their upper and lower states, \({d}_{ij}^{{\prime} }\) and \({d}_{ij}^{{\prime\prime} }\), respectively:

$${\rho }_{ij}={s}_{i}-{s}_{j}=[{E}_{{\rm{up}}(i)}-{E}_{{\rm{low}}(i)}]-[{E}_{{\rm{up}}(j)}-{E}_{{\rm{low}}(j)}]=[{E}_{{\rm{up}}(i)}-{E}_{{\rm{up}}(j)}]-[{E}_{{\rm{low}}(i)}-{E}_{{\rm{low}}(j)}]={d}_{ij}^{{\prime} }-{d}_{ij}^{{\prime\prime} }.$$
(4)

Thus, computed energy-level splittings, which can be added as wavenumbers of “virtual” lines to the input file, are able to eliminate the underdeterminacy induced by multiplet constraints. Note that resonance interactions among closely-spaced levels in the same J/symmetry block may decrease the accuracy of these computed splittings at high Js, which must be accounted for in the final uncertainty budget.

Within the W2024 dataset, the virtual transitions defined above are placed into a segment called “24virt” and correspond only to energy splittings of orthopara state pairs, whose assignments differ solely in their Ka or Kc quantum numbers (the error cancellation seems to work exceptionally well for these state pairs). The splitting values included in the 24virt segment are taken from the first-principles POKAZATEL74 energy list, for which

$${\mathcal{U}}({d}^{{\rm{POK}}})\approx \max \left[| {d}^{{\rm{POK}}}-{d}^{{\rm{BT2}}}| ,\min \left(0.1\,| {d}^{{\rm{POK}}}| ,\,0.025\,{{\rm{cm}}}^{-1}\right),\,1{0}^{-6}\,{{\rm{cm}}}^{-1}\right]$$
(5)

is employed as an (initial) uncertainty approximation, whereby dPOK and dBT2 are the POKAZATEL74 and BT275 estimates for the same splitting, respectively. A description of how uncertainties of relative positions are taken into account in the uncertainties of energies and predicted wavenumbers is offered in Supplementary Information (B).

Confidence intervals

To appreciate the importance of confidence intervals, a new concept introduced here to MARVEL, one needs to understand the limitations of network-based procedures for the recognition of outliers. During the analysis of SNs, outliers are lines with faulty wavenumbers, uncertainties, or assignments. As shown before40, outlier-detection tools designed for SNs are built upon the notion of network cycles (that is, sequences of connected lines and states, where each state has exactly two neighboring states) and network (in)consistency. It must also be stressed that there are a few misconceptions surrounding outlier detection in high-resolution spectroscopy40. One of them is related to latent outliers, which cannot be detected via network-theoretical means, as they do not violate the consistency of the SN.

Owing to potential error compensation in cycles, see misconception M5 in Ref. 40, in principle any transition might be a latent outlier. In practice, a latent outlier is typically (a) a bridge (i,e., a line without cycles) or (b) a transition whose uncertainty is smaller than the threshold (that is, the sum of uncertainties) in all of its cycles. For instance, if a line has an uncertainty of 10−4 cm−1, but it participates only in cycles with thresholds being 10−3 cm−1, the accuracy of this transition can be validated by MARVEL only to 10−3 cm−1.

Based on all these considerations, it is worth defining a measure of “validity”, what is called here a confidence interval (CI), characterizing each transition of the dataset. A CI value provides a lower limit, below which no error can be recognized by MARVEL in a line position or its uncertainty. The emphasis is on the lower-limit property of CI, because there is no upper limit for the magnitude of hidden errors (again, due to possible error cancellation). Intuitively, the CI of a line can be defined as the accuracy of its most accurate, non-trivial cycle (a trivial cycle has only two transitions with the same assignment). This specification leaves CI undefined for a line which is not part of any non-trivial cycles. Actually, for such transitions it is not meaningful to speak of a MARVEL-based validation. For a formal definition of the CI parameter and its extension to energy levels, see Supplementary Information (C).

Data sources and their treatment

There are only a limited number of data sources60,62,63,64,6669,76,77,78,79,80,81 which are available today but were not handled during the construction of the W2020-\({{\rm{H}}}_{2}^{\,16}{\rm{O}}\) dataset. Apart from five publications76,77,78,79,80, these sources were published after 2020. In the W2020 input file, 93GuRa82 was mistakenly referred to as ‘86GuRa’; this tag should refer to one of the new sources, Ref. 78. Seven W2020 sources, 67HaDo83, 73PuRa84, 09GrBoRiMa85, 12Boyarkin86, 20virt7, 20extra7, and 20compl7, have been fully removed from the present analysis. Of these sources, 09GrBoRiMa85 and 12Boyarkin86, which utilize multiphoton techniques to probe highly-lying states of \({{\rm{H}}}_{2}^{\,16}{\rm{O}}\), may well be included in a future update of W2024, when more accurate first-principles energies will be available above 30 000 cm−1, allowing a reliable validation of the 09GrBoRiMa85 and 12Boyarkin86 lines.

Table 1 contains segments constructed from the new sources considered during this study. While certain sources not divided up in W2020 into segments were divided into multiple segments in W2024, for the sake of simplicity these segments are not specified in Table 1. Furthermore, the 24virt segment, which substitutes its predecessors in W2020, 20virt, 20virt_S2, 20virt_S3, and 20virt_S4, is not given in Table 1 either. The 14 transitions which had to be deleted from the new sources are listed in Table 2.

Table 1 List of segments involving data sources new to W2024 compared to W2020-\({{\rm{H}}}_{2}^{\,16}{\rm{O}}\).
Table 2 Complete list of experimental lines deleted from the new sources given in Table 1.

During the construction of the W2024 database, it became necessary to add short comments to lines which, in certain aspects, must be distinguished from other transitions of the input dataset. Accordingly, the standard format of the line tags1, which consists of the segment name and a serial number, has been extended in this study with so-called markers. The principal markers used in the W2024 input file are listed in Supplementary Information (D).

As to the MARVEL treatment of the input transitions dataset, it is worth emphasizing a few important aspects. First, an orthopara doublet of a segment, observed under Doppler-limited conditions, was deemed to be unresolved if the separation of the reported experimental positions were smaller than one third of the associated Doppler half width at the actual measurement temperature. Second, when the ortho/para complement of a para/ortho line was not published, then it was added to the W2024 input [see also Supplementary Information (D)]. Third, lines within unresolved multiplets other than orthopara doublets were not subject to multiplet constraints, as their relative positions are usually not known accurately from first-principles computations; their confidence intervals have been increased to reflect their potential inaccuracy. Fourth, a set of empirical positions and energy levels, taken from the literature64,65,66,68,70,80, was used during the refinement of the wavenumber uncertainties, to reach better agreement, wherever possible, with these auxiliary data. Upon termination of the refinement process, MARVEL was re-executed by eliminating all but 11 lines of this auxiliary dataset from the final W2024 input. The 11 empirical transitions preserved come from the source 20MiKaMoCa80, see Table 1, which seem to rely partially on accurate unpublished experimental lines.

Data Records

The W2024 database is available in an OSF (Open Science Framework) repository87, which contains validated transitions, empirical energy levels, and an extensive line list for the \({{\rm{H}}}_{2}^{\,16}{\rm{O}}\) isotopologue. In the rest of this section, a brief summary is provided about the nine files located in the file “W2024.zip” within the W2024 repository.

As customary, the W2024 repository is accompanied with a “README.txt” file, including a concise description of the content of the other files. In “README.txt”, the file names are arranged in the order of their importance.

The entire collection of the 212 segments created from the 189 sources29,41,60,62,63,64,65,66,67,68,69,74,76,77,78,79,80,81,82,86,88,89,90,91,92,93,94,95,96,97,98,99,100,101,102,103,104,105,106,107,108,109,110,111,112,113,114,115,116,117,118,119,120,121,122,123,124,125,126,127,128,129,130,131,132,133,134,135,136,137,138,139,140,141,142,143,144,145,146,147,148,149,150,151,152,153,154,155,156,157,158,159,160,161,162,163,164,165,166,167,168,169,170,171,172,173,174,175,176,177,178,179,180,181,182,183,184,185,186,187,188,189,190,191,192,193,194,195,196,197,198,199,200,201,202,203,204,205,206,207,208,209,210,211,212,213,214,215,216,217,218,219,220,221,222,223,224,225,226,227,228,229,230,231,232,233,234,235,236,237,238,239,240,241,242,243,244,245,246,247,248,249,250,251,252,253,254,255,256 are presented in “W2024_segment_table.pdf”, where a couple of important statistical parameters are given for each segment, in a form similar to that of Table 1 (the difference is only that R is missing from “W2024_segment_table.pdf”).

The file “W2024_segments.txt” is the segment input file for the MARVEL code, where the unit of the line positions and their uncertainties are specified for each segment. The file “W2024_transitions.txt” contains the 309 290 input transitions collected for the MARVEL procedure. In this file, each input transition is associated with (a) a line position, (b) an initial and an adjusted line-position uncertainty, (c) a \(({v}_{1}^{{\prime} }\,{v}_{2}^{{\prime} }\,{v}_{3}^{{\prime} }){J}_{{K}_{a}^{{\prime} },{K}_{c}^{{\prime} }}^{{\prime} }\leftarrow ({v}_{1}^{{\prime\prime} }\,{v}_{2}^{{\prime\prime} }\,{v}_{3}^{{\prime\prime} }){J}_{{K}_{a}^{{\prime\prime} },{K}_{c}^{{\prime\prime} }}^{{\prime\prime} }\) rovibrational assignment, and (d) a line tag representing a unique identifier.

The empirical energy values, obtained for 19 027 rovibrational states in the 0–26 268 cm−1 range, are placed in the file “W2024_energy_levels.txt”. Each state of this data file is supplied with (a) a \(({v}_{1}\,{v}_{2}\,{v}_{3}){J}_{{K}_{a},{K}_{c}}\) label, (b) an empirical (MARVEL) energy, (c) an energy uncertainty followed by a (relative) confidence interval in parentheses, (d) the number of transitions incident to this state, and (e) the index of the respective POKAZATEL74 state.

The file “W2024-24MiVaCa_comparison.xls” lists 57 states, for which the W2024 and the 24MiVaCa70 energies deviate by more than 0.005 cm−1 or their assignments are different. For each line a short comment is given indicating a potential reason for the discrepancy.

Using empirical (W2024) and first-principles (POKAZATEL74) energies, a composite line list, named CW2024, was constructed, forming part of the file “CW2024_line_list.txt”. This line list consists of more than 490 000 dipole-allowed transitions in the 0–41 200 cm−1 range, with room-temperature intensities down to 10−31 cm molecule−1. For almost half of the CW2024 entries, that is for about 231 000 lines in the 0.07–25681.5 cm−1 region, empirical positions are reported; all of them are augmented with individual wavenumber uncertainties and (relative) confidence intervals. For all of the CW2024 lines, the intensities are taken from the POKAZATEL line list, complemented with their BT275 counterparts, whenever applicable. For the empirical transitions of this list, essential walks are also provided in “CW2024_walk_file.txt”. These walks help to understand how the empirical positions and their uncertainties can be approximately extracted from a handful of W2024 input lines [for details on the use of walks, see Supplementary Information (B)]. A line-by-line comparison between HITRAN 2020 and the (C)W2024 dataset is presented in the file “HITRAN_comparison.txt”, which will be discussed in the “Technical Validation” part of this paper.

Finally, the “MARVEL.zip” file contains a developer version of the MARVEL code, written in the C++ language. This version of the MARVEL code, distributed with the necessary input files, was used to generate the numerical data in the TXT files of the W2024 repository (except the input data listed in “W2024_transitions.txt”). The novel MARVEL features, implemented in this code version and described in the “Methods” section, will form part of the http://kkrk.chem.elte.hu/marvelonline/MARVELOnline web application in the future.

Technical Validation

Validation of the W2024 energy levels

The principal validation of the W2024 energy levels was performed via MARVEL, by checking the consistency of the input transitions in relation to their assignments, wavenumbers, and uncertainties. This process resulted in a self-consistent energy-level dataset with individual uncertainties and confidence intervals.

The W2024 energy levels have been matched with their first-principles (BT275, POKAZATEL74, and VoTe257) counterparts, making use of the |EW2024 − Ecomp| ≤ 10−4Ecomp criterion, where EW2024 and Ecomp denote empirical and computed energy values, respectively. Despite previous efforts74,75,257,258,259, no unambiguous labelling scheme exists for water isotopologues and, indeed, it is unlikely that such a scheme could be developed259,260. Owing to a number of notable differences in rovibrational assignments across the three datasets, only the J/symmetry labels were used during the formation of the (EW2024Ecomp) pairs.

Consistency of the W2024 energy levels was also checked via the pair identity and smooth variation rules of Ma et al.261. For each vibrational state and J, a plot was made of the energy versus the Ka quantum number. These plots were studied to insure their correct pairing structures and smooth variations. Everything checked out correctly, giving further confidence in the correctness of the W2024 assignments and empirical rovibrational energies.

Compared to W2020-\({{\rm{H}}}_{2}^{\,16}{\rm{O}}\), the W2024-\({{\rm{H}}}_{2}^{\,16}{\rm{O}}\) dataset deals with only a small number of new data sources. Nevertheless, it contains more than 500 new empirical rovibrational energy levels. How each new source contributes to the set of new energy levels is given in Table 3. Not too surprisingly, the largest contributor is the source 24virt, yielding empirical energies for the ortho/para complements of over 200 para/ortho state pairs. Note also that a few additional energy levels, not reflected in the numbers given in Table 3, were also obtained from the set of more than 1000 transitions reassigned during this study. Consideration of new sources has particular relevance when they provide new energy levels or help to determine improved empirical energy values and/or uncertainties for states already available. For the latter case, the highly accurate sources listed in the first few rows of Table 1 proved to be particularly useful.

Table 3 Contribution of the new sources to the set of new W2024 energy levels.

A detailed comparison of the W2024 energy levels with their W2020 counterparts reveals occasional significant shifts, displayed in Table 4, in previously known energy values. As shown there, not only the less accurate emission sources, like 08ZoShOvPo204 and 05CoBeCaCo180, but some of the more dependable absorption sources, namely 08ToTe203, 08CaMiLi200, 11BeMiCa208, and 14ReOuMiWa225, produced a few unreliable energies, as well. All in all, there are only about 500 cases where the W2024 – W2020 deviations fall outside of the W2020/W2024 uncertainties.

Table 4 Ten selected W2024 energy levels deviating by more than 0.1 cm−1 from their W2020 counterparts.

To provide a comprehensive picture about the collection of W2024 energy levels, their distributions are plotted against the rovibrational energies and their uncertainties in Fig. 1. For reference purposes, the energy distribution of the first-principles POKAZATEL74 states used, forming a complete set in the 0–26 268 cm−1 range investigated, is also given. As obvious from Fig. 1, (a) all states are known in W2024 up to 8995 cm−1, (b) the number of missing empirical states increases rapidly as the energy increases, (c) for a significant number of states the energies are known with an accuracy better than 10−6 cm−1, and (d) a few W2024 states, deduced from some less accurate 24virt lines, have relatively large, 0.1 – 0.7 cm−1, uncertainties (such states could be targeted by future measurements).

Fig. 1
figure 1

Distribution of the W2024 and POKAZATEL states along the rovibrational energies and their uncertainties. The lower panel gives the distribution of the energy values for the W2024 and the POKAZATEL datasets. The upper panel provides the distribution of W2024 states by uncertainties. The range represented by a bin is given by the actual and the previous axis ticks (e.g., the blue bin at 10−4 cm−1 contains empirical states with an uncertainty of 10−5–10−4 cm−1). The state counts, that is the bin sizes, are plotted on a unified bi-directed vertical axis for both distributions. Note the logarithmic scale on both parts of the vertical axis.

Comparison with HITRAN 2020

Comparing the (C)W2024 and the HITRAN 2020-\({{\rm{H}}}_{2}^{\,16}{\rm{O}}\) line catalogs is particularly important, as it allows additional validation of the W2024 database; furthermore, it might reveal HITRAN entries which require further verification/modification. Results of this comparison are discussed next, without reliance on BT2 intensities.

To facilitate the comparison of the HITRAN 2020-\({{\rm{H}}}_{2}^{\,16}{\rm{O}}\) line list with the (C)W2024 dataset, an attempt was made to set up a simple quality-assessment scheme for six HITRAN data types which are also present in CW2024. A six-character quality sequence, Q1-Q2-Q3-Q4-Q5-Q6, has been introduced, where Qp symbolizes the pth quality indicator (QI). Intuitive definitions for the four possible values, A–D, of the six QIs are included in Table 5. Briefly, (i) “A” is the best category, (ii) “B” means acceptable, given the present knowledge, (iii) “C” indicates a conflict between (C)W2024 and HITRAN, which is most probably due to the incorrectness of the HITRAN entry, and (iv) “D” means that no verification was possible for a data type. Table 5 also lists four “comments” attached by us to a few peculiar HITRAN 2020 lines. For further details on the QI values, see Supplementary Information (E).

Table 5 Quality indicators and comment categories used to characterize the HITRAN-\({{\rm{H}}}_{2}^{\,16}{\rm{O}}\) lines.

During this comparison, more than 2500 “W2020” transitions have been identified in HITRAN 2020 for which either the position or the rovibrational assignment differ significantly from that contained in the W2020 database8. Table 6 gives six characteristic examples for such questionable HITRAN 2020 transitions. For these lines, certain parameter values were seemingly incorrectly transcribed from W2020.

Table 6 Typical examples for problematic transitions in the HITRAN 2020-\({{\rm{H}}}_{2}^{\,16}{\rm{O}}\) list.

Figure 2 depicts the distribution of the 13 most common quality sequences, corresponding to 90% of the HITRAN 2020-\({{\rm{H}}}_{2}^{\,16}{\rm{O}}\) lines. The good news is that the leading sequence is “6A”, see the dark green slice in Fig. 2, where all the six data types of Table 5 are corroborated by (C)W2024. Nevertheless, there is a considerable number of transitions which require particular attention, and may lead, after additional validation, to corrections of certain HITRAN 2020 entries. Lines falling into the “gray zone”, with a sequence “6D”, for which none of the six HITRAN parameters could be affirmed by (C)W2024, must be investigated carefully. Note in this respect that POKAZATEL intensities are highly accurate in the infrared, but get increasingly inaccurate as one moves toward visible wavelengths262,263. For the full list of quality sequences attached to the HITRAN 2020-\({{\rm{H}}}_{2}^{\,16}{\rm{O}}\) lines, see ref. 87.

Fig. 2
figure 2

Distribution of the most frequent quality sequences, covering 90 % of the HITRAN 2020-\({{\rm{H}}}_{2}^{\,16}{\rm{O}}\) lines. The color codes applied for the quality sequences are shown on the left-hand side of this figure, where slices formed by empirical, computed, or mixed (empirical plus computed) HITRAN 2020 transitions are clearly distinguished. The blue arrow indicates the direction whereby the slices follow the ordering utilized in the color legend. To highlight its most important characteristics, each slice is supplied with short stamps, displayed on the pie charts (notice that there are stamps shared between two slices). The relative fractions of the individual slices, with respect to the complete HITRAN 2020-\({{\rm{H}}}_{2}^{\,16}{\rm{O}}\) database, are included in colored boxes. Those lines which could not be matched with the (C)W2024 line list appear in the gray slice. The remaining transitions, whose quality sequences differ from those exhibited in the color legend, are collected in the cyan slice. For improved transparency, slices with less than 10 000 lines are enlarged in an inset.

Usage Notes

The empirical energy levels derived during this study are associated with individual uncertainties and confidence intervals, giving numerical characterization of the trust we have in the W2024 energies, as well as in the predicted transition wavenumbers. These important statistical parameters must be taken into account in applications using the (C)W2024 datasets. The set of new and corrected empirical energies of this study could, for example, prove useful for adjusting existing potential energy surfaces of the \({{\rm{H}}}_{2}^{\,16}{\rm{O}}\) molecule, reducing the discrepancies between the results of variational nuclear-motion computations and experiment.

The CW2024 database could be helpful for experimental spectroscopists, who wish to (re)analyze their new and old spectra, especially when looking for new energy levels absent from the W2024 energy list. This CW2024 catalog would also provide support for the validation and occasional correction of \({{\rm{H}}}_{2}^{\,16}{\rm{O}}\) transitions present in line-by-line spectroscopic databases.

Updated versions of the database files will be made available at the website https://respecth.elte.hu/. Version history will be provided in a file called “NOTES.txt” under the W2024 repository87.