Hierarchy-boosted funnel learning for identifying semiconductors with ultralow lattice thermal conductivity

Wu, Mengfan; Yan, Shenshen; Ren, Jie

doi:10.1038/s41524-025-01583-9

Download PDF

Article
Open access
Published: 22 April 2025

Hierarchy-boosted funnel learning for identifying semiconductors with ultralow lattice thermal conductivity

Mengfan Wu^1,2,
Shenshen Yan^1,2 &
Jie Ren^1,2

npj Computational Materials volume 11, Article number: 106 (2025) Cite this article

3079 Accesses
Metrics details

Subjects

Abstract

Data-driven machine learning (ML) has demonstrated tremendous potential in material property predictions. However, the scarcity of materials data with costly property labels in the vast chemical space presents a significant challenge for ML in efficiently predicting properties and uncovering structure-property relationships. Here, we propose a novel hierarchy-boosted funnel learning (HiBoFL) framework, which is successfully applied to identify semiconductors with ultralow lattice thermal conductivity (κ_L). By training on only a few hundred materials targeted by unsupervised learning from a pool of hundreds of thousands, we achieve efficient and interpretable supervised predictions of ultralow κ_L, thereby circumventing large-scale brute-force ab initio calculations without clear objectives. As a result, we provide a list of candidates with ultralow κ_L for potential thermoelectric applications and discover a new factor that significantly influences structural anharmonicity. This HiBoFL framework offers a novel practical pathway for accelerating the discovery of functional materials.

Generative deep learning for predicting ultrahigh lattice thermal conductivity materials

Article Open access 11 April 2025

Unlocking phonon properties of a large and diverse set of cubic crystals by indirect bottom-up machine learning approach

Article Open access 15 August 2023

Prediction of thermoelectric performance for layered IV-V-VI semiconductors by high-throughput ab initio calculations and machine learning

Article Open access 29 October 2021

Introduction

Emerging as a powerful technology of the data-driven paradigm in materials science, machine learning (ML) has considerably accelerated the design and discovery of promising materials in recent years^1,2,3,4, including ML interatomic potentials⁵, inverse design of materials⁶, efficient property predictions^7,8,9. Simultaneously, the advancements in high-performance computing greatly facilitate the establishment of diverse material-related databases utilizing density functional theory (DFT)-based high-throughput calculations (HTC), such as the Materials Project (MP)¹⁰, Open Quantum Materials Database (OQMD)^11,12, Automatic-FLOW for materials discovery (AFLOW)¹³, Joint Automated Repository for Various Integrated Simulations (JARVIS)¹⁴, etc. These continuously expanding databases also lay a solid foundation for the application of cutting-edge ML technology in the field of materials science.

As the two main categories of ML, namely supervised and unsupervised learning strategies, both have achieved remarkable success in different ways. On the one hand, supervised learning enables the efficient predictions of material properties by-passing solving the complex equations of quantum mechanics based on expensive DFT calculations, which emphasizes the requirement of large human-labeled datasets for model training to ensure accuracy. Ridge regression¹⁵, decision tree¹⁶, support vector machine¹⁷, random forest¹⁸, gradient boosting decision tree¹⁹, etc., are widely employed in designing and screening potential materials with desired properties, which also elucidate the close relationships between the structure and target property²⁰. On the other hand, as a technology operates without the necessity of well-labeled training data, unsupervised learning possesses the capability to infer the underlying patterns among varieties of materials within a feature space. Predominantly employed methods in unsupervised learning encompass clustering, dimensionality reduction, and anomaly detection, in which clustering can categorize different materials into the corresponding clusters by assessing their similarities between each other, thereby identifying candidates resembling the anticipated data points^21,22,23. In this context, combining HTC and ML technologies can not only efficiently explore novel materials in the vast chemical space but also gain insights into the structure-property relationships at a quantitative level. However, a significant challenge lies in labeling data for materials with intrinsically complex properties in the vast chemical space, especially the lattice thermal conductivity (κ_L), due to the complexities involved in experimental measurements and the unbearable computational costs associated with accurate DFT calculations. Is there an effective approach to reduce the cost of labeling, while enabling the efficient prediction of complex properties and elucidation of structure-property relationships?

Functional materials exhibiting ultralow κ_L possess vital significance across various fields, such as power generation²⁴, heat conduction²⁵, thermal barrier coatings²⁶ and so on^27,28,29, which greatly advance the development of industry. Particularly, owing to the key role in directly converting heat energy into electricity based on the thermoelectric (TE) effect, TE materials have become the focal point of considerable interest in academic and industrial research. Indeed, the conversion efficiency of a TE material is theoretically quantified by its figure of merit zT: zT = S²σT/κ, where S is the Seebeck coefficient, σ is the electrical conductivity, κ is the thermal conductivity, and T is the absolute temperature, respectively. Moreover, κ can be expressed in two parts: κ = κ_e + κ_L, where κ_e and κ_L are the electronic thermal conductivity and lattice thermal conductivity, respectively, indicating that they all contribute to heat conduction. For metallic materials, κ_e plays a dominant role in heat conduction due to the presence of a large number of free electrons. On the contrary, in semiconductors or insulators, thermal energy is predominantly transferred through lattice vibrations, with κ_L making the primary contribution. In this process, the quanta of such lattice vibrations in a solid are the so-called phonons.

As clearly noticed, toward the goal of seeking TE materials with optimal zT values, it requires not only maximizing the power factor (S²σ) but also minimizing the thermal conductivity (κ_e + κ_L) simultaneously³⁰. Notably, the distinct and separate scale of the mean free paths for electrons and phonons contributes to the independence of κ_L as a parameter in zT, which makes decreasing κ_L become a significant avenue to realize the ideal concept of “phonon-glass and electron-crystal”^31,32, ultimately leading to excellent TE performance. For the past decades, unremitting efforts have been made to discover a series of materials with κ_L, for instance, Tl₃VSe₄³³, TlInTe₂³⁴, InTe³⁵, CsAg₅Te₃³⁶, AgSbSe₂³⁷, etc. However, the conversion efficiency of TE materials has persistently been a significant challenge in the efficient recovery of waste heat, that is to say, useful materials featuring ultralow κ_L are still in urgent demand. Despite the few successes of traditional trial-and-error experiments and case-by-case DFT calculations in the exploration of desired materials, efficient and robust material design oriented towards κ_L in the vast chemical space is hindered by the complex structure–property relationships and size-limited resources.

In the present work, we propose a novel hierarchy-boosted funnel learning (HiBoFL) framework that integrates unsupervised learning and supervised learning to efficiently achieve the predictions of costly-labeled complex properties, which is successfully applied to identify semiconductors with ultralow κ_L. Unsupervised learning is used to uncover underlying patterns among different materials, facilitating the identification of problem-specific clusters with a high likelihood of exhibiting ultralow κ_L. Based the low-cost HTC on this significantly reduced space of specific clusters, we establish a local database and discover a series of semiconductors with ultralow κ_L, in which Cs₂SnSe₃ and Cs₂GeSe₃ are screened out for in-depth mechanism analysis. Furthermore, a supervised classification model for directly predicting ultralow κ_L is trained to refine the coarse result of unsupervised learning. With resolved important descriptors that govern ultralow κ_L, we are capable of investigating the κ_L modulation mechanisms and uncovering a new factor that governs structural anharmonicity. We expect that this HiBoFL framework can also be widely applied in the discovery of other functional materials with excellent performances.

Results

HiBoFL framework for accelerating the discovery of functional materials

Our proposed novel HiBoFL framework for accelerating the discovery of functional materials with complex properties is shown in Fig. 1a, which exhibits a funnel-like structure driven by a hierarchical framework, effectively narrowing the search space while boosting model performances. This framework mainly includes four parts: (I) Data preparation. An initial theoretical or experimental dataset of the target material system is required, potentially involving preliminary high-throughput screening, data cleaning, and other preprocessing operations. (II) Unsupervised learning. Relevant features are extracted from the initial dataset, encompassing aspects such as experimental process parameters, intrinsic material attributes, etc. By employing clustering algorithms, distinct classes of data points with potentially similar properties can be identified, in which problem-specific clusters are selected to narrow the search space. (III) Data annotation. Data from the problem-specific clusters can be assigned corresponding property labels through further relatively low-cost experimental characterizations or HTC. This facilitates the establishment of a local database, enabling the extraction of prior domain knowledge through statistical data analysis. (IV) Supervised learning. The labeled dataset within the established local database can be used to further train supervised learning models, which helps refine the coarse result from unsupervised learning, ultimately enabling direct and rapid predictions of the target properties. Such a HiBoFL framework can not only reduce the expensive cost of labeling data for training supervised models, but also efficiently predict the complex properties.

We then apply this HiBoFL framework to efficiently identify semiconductors with ultralow κ_L as shown in Fig. 1b. In the first step, we obtain the material dataset from the MP database based on its application programming interface (API) and then apply a series of screening criteria to derive the first-level dataset for subsequent research. In the second step, chemical composition descriptors based on Magpie³⁸ and crystal structure descriptors derived from Voronoi tessellations³⁹ are used to featurize the materials in the first-level dataset. We use principal component analysis (PCA)⁴⁰, for dimensionality reduction thereby obtaining pivotal components. K-means clustering^41,42 is then employed to identify materials with similar κ_L, which is visualized in a low-dimensional space using t-distributed stochastic neighbor embedding (t-SNE)⁴³, and problem-specific clusters are selected based on similarity design rules to form the second-level dataset. In the third step, we use the phonon-elasticity-thermal (PET) model⁴⁴ to perform low-cost HTC on the materials in the second-level dataset, establishing a local database based on the HTC results. On the one hand, a list of candidates for potential TE applications can be recommended and their statistical analysis can help us to summarize several domain knowledge of ultralow κ_L. On the other hand, we can directly screen out candidate materials with ultralow κ_L for in-depth mechanism analysis, where the results are verified by accurately solving the phonon Boltzmann transport equation (BTE) and the phonon thermal transport mechanisms are further revealed. In the fourth step, we perform ensemble learning on the labeled local database for directly classifying ultralow κ_L, training a robust CatBoost classifier⁴⁵. The interpretable analysis of the pre-trained model based on the SHapley Additive exPlanations (SHAP) method^46,47,48 further reveals the influence of important descriptors on ultralow κ_L. A new factor capable of significantly reducing κ_L via enhancing structural anharmonicity is discovered, eventually building a bridge between the ML model interpretability and first-principles analysis.

Preliminary high-throughput screening

The process of preliminary high-throughput screening to select materials for subsequent ML investigations is illustrated in Fig. 2a. Initially, we start by acquiring all the materials from the MP database based on its API, resulting in a total of 1,54,718 entries saved in a JSON file as the Python dictionary object. Of these, several specific criteria are set to narrow down the range of exploration. The first screening criterion focuses on the assessment of thermodynamic stability, in which the materials with energy above the convex hull (E_hull, the formation energy difference between the target compound and its competing phases) no more than 0.05 eV/atom are considered to show a high likelihood of being synthesized in experiments. Further, the band gap (E_g) can be directly retrieved from the MP database, set within the range of 0.1–2 eV for assessing electrical conductivity. This specific interval is highly characteristic of semiconductors, demonstrating the inherent capacity of these screened materials for favorable electrical conductivity. Additionally, taking into account of the computational cost associated with κ_L, our constraints on the material system involve ensuring that the number of atoms (N_atoms) is less than 20 in the unit cell of a crystal structure and the number of elements (N_elements) is below four in one compound, respectively. Ultimately, through excluding the materials containing hydrogen, lanthanides, and actinides, and conducting structural analysis, we further obtain 2675 three-dimensional (3D) crystal structures without calculation errors. These materials constitute the first-level dataset for our in-depth investigations.

**Fig. 2: Dataset preparation and unsupervised learning optimization.**

Unsupervised learning for identifying materials with similar κ _L

We next carry out unsupervised learning to identify materials with a high likelihood of exhibiting relatively low κ_L based on similarity design rules. Following the generation of the first-level dataset as input, the materials should be transformed into the length-fixed vectors as the so-called descriptors, which are the key point to distinguish different materials. Herein, we use two distinct types of descriptors to featurize these materials, namely, composition-based features and structure-based features, to account for the mapping relationships between different categories of descriptors and κ_L from the perspectives of chemical composition and crystal structure. Among them, composition-based features are generated based on Magpie data and denoted by {E}, including electronegativity, atomic number, fraction of electrons, etc., which are comprehensive enough to capture the characteristics of chemical compositions with different constituent elements and proportions. Through partitioning the crystal structures into Wigner-Seitz cells of each atom, a series of structure-based features can be derived from Voronoi tessellations based on the characteristics of the local environment of each atom in the unit cell, which are denoted by {S}. In this manner, each material can generate a total of 273 descriptors automatically without any time-consuming DFT calculations based on its corresponding composition and structure, jointly denoted by {E, S}. Notably, these descriptors have been successfully applied to the ML predictions of κ_L at various temperatures⁴⁹, indicating their significant potential for mapping the property of κ_L. After feature generation, it is crucial to preprocess all the data thereby enhancing the performances of clustering model. To address the large variations among different feature values and transform them to follow a normal distribution, both standardization and quantile transformer are employed for preprocessing these descriptors. Further, we perform PCA based on singular value decomposition to project the preprocessed features to a lower dimensional space. PCA linearly combines original features into principal components (PCs), ensuring the hierarchical order based on their contributions, in which the first PC captures the largest explainable variance, followed by the second one and so forth. From the curve of total explainable variance changing with the number of PCs as shown in Figure S1, we can conclude that 83 PCs suffice to account for 99% of the variance among all 2675 materials in the first-level dataset. Hence, these 83 PCs are extracted as the pivotal components for input into the following clustering algorithm.

Subsequently, the k-means algorithm is utilized to identify the underlying associations with κ_L from these unlabeled materials, in which only one critical parameter is required to be pre-defined, i.e., the number of clusters k⁵⁰. Based on the analysis of the elbow method⁵¹ and the silhouette coefficient⁵² as depicted in Fig. 2b, the elbow-like point of inflection on the curve emerges at a value of k = 7, indicating the location where inertia or distortion starts decreasing significantly at a very slow rate, represents the optimal k value for the k-means clustering. At this stage, a relatively high silhouette coefficient also illustrates that each material exhibits strong similarity within its respective cluster, while different clusters are well-separated as much as possible. As a result, we partition the first-level dataset into seven clusters (from C₁, C₂, …, to C₇), on the basis of similarities of these feature vectors derived from chemical compositions and crystal structures, in which the materials in the same cluster are considered to show a high likelihood of possessing similar κ_L. Since the first two PCs only capture about 35% of the variance within entire data, it is insufficient for the clustering result to be intuitively visualized in a two-dimensional (2D) mapping. Unlike PCA, which emphasizes preserving large pairwise distances to maximize variance, t-SNE is a powerful non-linear dimensionality reduction technology for the visualization of high-dimensional data, aiming to maintain pairwise similarities among data points in a lower-dimensional space. We strictly use t-SNE only for the visualization of the resulting seven clusters of k-means algorithm as shown in Fig. 3, projecting the 83 PCs into a 2D latent space comprised of two t-SNE components. The result intuitively demonstrates the high-quality clustering with distinct separation between each cluster, where each point represents a material and its color relates to the corresponding category of cluster.

**Fig. 3: Unsupervised learning result in the first-level dataset.**

Since we have obtained seven clusters through k-means clustering, it is necessary to conduct the similarity analysis of these clusters. Materials within the same cluster are considered to show similar structures thereby likely sharing similar properties, which facilitates a deep comprehension of underlying patterns and relationships among the 2675 materials. To evaluate the similarity criteria of these materials, the reported κ_L values at 300 K for several known materials included in each cluster are collected from the previous studies (most are experimentally measured), which are all listed in Table S1 along with some other basic information. As we expected, materials within the same cluster exhibit closely similar κ_L values, while there is a comparatively significant difference in κ_L among materials from different clusters. Particularly, the known materials with relatively low κ_L are clustered into C₁ and C₂ through k-means clustering, including eight structures of Tl₃AsSe₃ (0.23 W/mK)⁵³, Tl₂Te₃ (0.40 W/mK)⁵⁴, Tl₃SbS₃ (0.42 W/mK)⁵⁵, TlBiS₂ (0.80 W/mK)⁵⁶, CuBr (1.30 W/mK)⁵⁷, Cu₂GeS₃ (1.20 W/mK)⁵⁴, RbSbS₂ (1.60 W/mK)⁵⁵ and AgGaS₂ (1.50 W/mK)⁵⁸, which are all clearly shown in the enlarged part of Fig. 3. On the contrary, C₇ contains materials with apparently large κ_L, including four structures of GaN (130 W/mK)⁵⁹, BP (350 W/mK)^60,61, SiC (490 W/mK)⁶², and Si (156 W/mK)⁶³. Thus, as confirmed by the good distinction between low and high κ_L among each cluster, our proposed unsupervised learning model demonstrates great potential to identify compositional and structural information about the κ_L of these materials, leading to the successful clustering into different categories according to this property. Since the materials with a high likelihood of possessing relatively low κ_L tend to group into the two clusters of C₁ and C₂, the exploration scope for finding materials with low κ_L is reduced from 2675 materials to 704 materials, narrowing by approximately three-quarters. As a result, these two problem-specific clusters (C₁ and C₂) further constitute our second-level dataset.

HTC of problem-specific clusters for statistical analysis and material discovery

Given the success of unsupervised learning in significantly reducing the broad material search space, it has become feasible to further extend the second-level dataset into a labeled repository of thermal conductivity through HTC at affordable computational costs. Based on the PET empirical equation within the high-throughput framework, we derive the κ_PET values at 300 K ignoring the anisotropy for materials in the second-level dataset, while excluding structures that do not satisfy the mechanical stability criteria. The basic information and corresponding κ_PET values of the resulting 661 materials are then stored in a local database using MongoDB⁶⁴, serving as a valuable repository to retrieve data and prioritize detailed theoretical and experimental investigations. Of particular note is that nearly 70% of these semiconductors exhibit the κ_PET values no greater than 2 W/mK (left panel in Fig. 4a), with a considerable portion of these materials remaining unreported to date. These materials also constitute a list of candidate materials with potential applications in the TE field (Table S2), which undoubtedly prove that the problem-specific clusters we identified through unsupervised learning indeed contain a significant number of materials with low thermal conductivity.

**Fig. 4: HTC-based data annotation for statistical analysis and material discovery in the second-level dataset.**

In addition, we also count the distribution of different formula types with counts exceeding 20 in these materials with κ_PET ≤ 2 W/mK (right panel in Fig. 4a). Apparently, the structures represented by the two types of formula anonymous dominate in quantity, namely the ABC₂ type characterized by the diamond-like structure, and the ABC₃ type represented by the perovskite-type structure. To gain a more intuitive insight into the distribution patterns of materials with low thermal conductivity, we plot a heat map over the periodic table of elements in Fig. 4b, illustrating the count of elements present in these compounds with κ_PET no greater than 2 W/mK. Sulfur, selenium, tellurium, and oxygen belonging to the chalcogens consecutively occupy the largest counts among the anion elements, in which sulfur is the most abundant with a count of 102. As for the cation elements, silver, cesium, potassium, copper, and rubidium occupy the highest abundance, respectively. This can be explained by the fact that materials composed of heavy elements or characterized by weak chemical bonding typically exhibit lower κ_L. Notably, a few previous studies have constrained the search space of materials within the above-mentioned characteristics to investigate those with low thermal conductivity, such as high-throughput screening in chalcogenide ABC₃ perovskites⁶⁵ or diamond-like ABC₂ compounds⁶⁶, detailed analysis in the IV-VI chalcogenides⁶⁷ and so forth^37,68,69,70. The box plot of different crystal systems indicates that the κ_PET values of materials in each crystal system primarily cluster around 0.5 W/mK, with the monoclinic system exhibiting the largest quantity (Fig. 4c). We also show the density distributions of the shear modulus and bulk modulus obtained by the Voigt–Reuss–Hill (VRH) method as shown in Fig. 4d, which are primarily concentrated within the relatively small range of ~50 GPa. This suggests that materials with lower shear and bulk moduli might be more likely to exhibit lower thermal conductivity. These statistical analyses provide deep insights into the regulation of κ_L, which can offer prior domain knowledge for researchers in selecting specific systems to obtain ultralow κ_L.

To further validate the results and conduct in-depth analysis of the phonon thermal transport mechanisms, we calculate more precise κ_L values according to first-principles derived force constants and Boltzmann transport theory for the unreported materials Cs₂SnSe₃ and Cs₂GeSe₃, which are screened out based on the formula type ranking among the materials with the lowest κ_PET values. Figure 4e depicts the calculated κ_L as the function of temperature ranging from 100 to 500 K at different axes in the discussed two semiconductors, in which all the intrinsic κ_L values show obvious anisotropy and gradually decrease following the T^-1 manner with the temperature increasing just as the hallmark of Umklapp scattering. The rise in temperature results in an elevation in the equilibrium phonon population, consequently leading to intense phonon-phonon collisions, as delineated by $\overline{n}\approx {k}_{{\rm{B}}}T/\hslash \omega (T\gg {\Theta }_{{\rm{D}}})$, where $\overline{n}$ is the average number of phonons, k_B is the Boltzmann constant, T is the temperature, ω is the phonon frequency, ℏ is the reduced Planck constant, and Θ_D is the Debye temperature. Moreover, these materials all exhibit intrinsically ultralow κ_L, with values all below 0.25 W/mK in any direction at 300 K, significantly lower than the κ_L (~2.3 W/mK) of the traditional TE material PbTe⁷¹. The results indicate the potential application of Cs₂SnSe₃ and Cs₂GeSe₃ in the TE field, further substantiating the effectiveness of our previously adopted approach–combining unsupervised learning with HTC to discover semiconductors with ultralow κ_L.

Mechanisms of phonon thermal transport properties

Generally speaking, compounds with heavier atoms tend to exhibit lower group velocities due to the reduced phonon frequency. As a result, a lower κ_L of Cs₂SnSe₃ was expected given the relatively heavy nature of Sn. However, it is noteworthy that the κ_L of Cs₂SnSe₃ is obviously higher than that of Cs₂GeSe₃ (i.e., ~3 times in the a-axis, ~1.7 times in the b-axis, ~2 times in the c-axis at 300 K), which presents an interesting unusual phenomenon. Next, we would like to discuss the microscopic mechanisms responsible for the ultralow κ_L and unusual difference observed among the two materials.

Cs₂SnSe₃ and Cs₂GeSe₃ both crystallize in the same space group C2/m (No. 12) of the monoclinic crystal system, with the fully optimized crystallographic parameters listed in Table S3. Their crystal structures are shown in Fig. 5a, b. Sn/Ge atoms are coordinated by four Se atoms in a tetrahedral geometry [SnSe₄]/[GeSe₄] at distances in 2.504–2.643 Å/2.336–2.477 Å, and these tetrahedra further share a common edge to form dimeric [Sn₂Se₆]/[Ge₂Se₆] building units, respectively, which form the four-membered [Sn₂Se₂]/[Ge₂Se₂] rings and are charge compensated by Cs cations. The phase diagrams for Cs–Sn–Se and Cs–Ge–Se systems based on the calculated energies in the MP database show that Cs₂SnSe₃ and Cs₂GeSe₃ possess similar thermodynamical stability with equal (zero) convex hull distances (Fig. S2). Simultaneously, the phonon dispersion curves along high-symmetry directions in Brillouin zone indicate that they are dynamically stable due to the absence of imaginary phonon modes (left panels in Fig. 5c, d). Thus, the results demonstrate the feasibility of experimentally synthesizing these two materials. To identify the bonding characteristics in Cs₂SnSe₃ and Cs₂GeSe₃, we employ the electron localization function (ELF) to quantify the extent of spatial localization of the reference electron with values ranging from 0 to 1 (Fig. 5a, b). The localization of electrons in the Sn/Ge–Se bonding region illustrates the covalent nature of the Sn/Ge–Se bonds, in which a polar covalent bond between Sn and Se (ELF ≈0.5) is observed due to the smaller electronegativity and larger atomic radius of Sn, leading to a significant difference among the two compounds. On the contrary, there is no overlapping of charge clouds between Cs atoms and other atoms, indicating the presence of strong ionic bonding which aligns with the fact that a relatively large electronegativity (on the pauling scale) difference (1.76) between Cs (0.79) and Se (2.55) results in strong ionic characteristics. Therefore, Cs₂SnSe₃ and Cs₂GeSe₃ both contain multiple types of bonds, i.e., ionic Cs–Se and covalent Sn/Ge–Se, resulting in complex crystal structures with bonding hierarchy. These complex structures, comprising heavy atoms, weakly bound and rigid distorted units with a significant bonding hierarchy, are anticipated to exhibit large lattice anharmonicity⁷².

**Fig. 5: Mechanisms of phonon thermal transport properties based on first-principles calculations.**

The phonon dispersion reveals that there are 3 acoustic branches and 33 optical branches at each phonon wave vector q due to the 12 atoms in the primitive cell of both Cs₂SnSe₃ and Cs₂GeSe₃, in which three acoustic branches are composed of one longitudinal acoustic mode (LA) and two transverse acoustic modes (TA and TA’). A striking common feature of their phonon dispersion is that a waterfall-like low-lying optical branch (LLO) exhibits the avoided crossing behavior with acoustic modes around the Brillouin zone center, resulting in strong acoustic-optical coupling as one of the potential signals of the rattling model⁷³. These characteristics can not only lead to a softening of the acoustic modes thereby yielding low phonon group velocities, but also greatly enhance the scattering rates of heat-carrying acoustic phonons, all of which contribute to the suppression of κ_L for Cs₂SnSe₃ and Cs₂GeSe₃. Analysis of the atom-projected PDOS and spectral κ_L(ω) (middle and right panels in Fig. 5c, d) indicate that acoustic and LLO phonon modes in the low-frequency range (0–2 THz) are primarily dictated by Cs and Se vibrations followed by Sn/Ge vibrations, which also make predominant contributions to the ultralow κ_L. A localized region mostly contributed by Cs atoms within a narrow energy window centered around 1 THz is observed in both compounds, indicating the anharmonic rattling-like motion of the weakly bonded Cs atoms, which is responsible for the presence of the soft LLO modes. This can also be confirmed in the potential energy curves obtained by shifting the atoms with respect to their static equilibrium positions along different axes as depicted in Fig. S3. Sn/Ge and Se atoms are both confined within the comparatively steep potential wells, whereas Cs atoms with heavier mass can vibrate easily with larger amplitude due to the shallowest potential energy surface in all the directions. These atoms exhibit nearly identical displacement magnitudes in their respective compounds. Therefore, the loosely bound Cs atoms are surrounded by [Sn₂Se₆]/[Ge₂Se₆] units thereby exhibiting the same rattling-like effect, which can result in a common strong anharmonicity in both compounds^74,75.

Figures 5e and S4 show the frequency dependence of the group velocity υ from all the q points in different axes at 300 K for Cs₂SnSe₃ and Cs₂GeSe₃, as given by υ_λ = ∂ω_λ/∂q. Most of the phonon modes possess ultralow υ values less than 2 km/s, confirming the lattice softening induced by the weak interatomic bonding, which is also consistent with the flat phonon bands observed in the phonon dispersion. Figure 5f depicts the overall short τ values mainly ranging from 0.2 to 20 ps, with low-frequency phonons exhibiting a gradual shortening trend in both compounds. This can be attributed to the presence of LLO modes, which facilitate more scattering paths and impede the heat flow, significantly enhancing the scattering rates thereby reducing τ for both acoustic and LLO modes, ultimately reducing κ_L. It is noteworthy that in the low-frequency phonon region, particularly within the acoustic phonon range, the phonon group velocities of these two compounds show little difference, yet the phonon lifetime of Cs₂SnSe₃ is significantly longer than that of Cs₂GeSe₃. Therefore, we conclude that the relatively lower anomalous κ_L of Cs₂GeSe₃ is attributed to its shorter phonon lifetime.

We speculate that the difference in the phonon lifetime between Cs₂SnSe₃ and Cs₂GeSe₃ might be related to the strength of Sn–Se and Ge–Se covalent bonding. This covalent bonding may induce anisotropic motion, involving the collective movement of Sn/Ge and Se atoms, thereby leading to a strong lattice anharmonicity within the system⁷⁶. To investigate this effect on the phonon lifetime, we perform crystal orbital Hamilton population (COHP) calculations⁷⁷ to analyze the energy-resolved local bonding information of Cs₂SnSe₃ and Cs₂GeSe₃. Figure 5g shows that anti-bonding states persist down to −3 eV below the Fermi level (E_F) in Sn–Se bonding, which can weaken the corresponding bonding strength thereby forming a polar covalent bond as observed in ELF. This can be demonstrated by the integral of COHP (${\rm{ICOHP}}=\mathop{\int}\nolimits_{-\infty }^{E}{\rm{COHP}}(E){\rm{d}}E$) at the E_F, which represents all occupied orbitals, serving as an indicator of bond strength. Since the average ICOHP values for Sn–Se and Ge–Se bonds are −0.72 eV and −4.49 eV, respectively, indicating the obviously weaker Sn–Se bonding in Cs₂SnSe₃. The stronger bonding strength of Ge–Se is also evident from its larger force constant $\left\vert {\Phi }_{ij}\right\vert$ in comparison to that of Sn–Se (Fig. S5). Hence, the shorter phonon lifetime of Cs₂GeSe₃ relative to Cs₂SnSe₃ may be attributed to their significant difference in Ge–Se and Sn–Se covalent bonding, which can be further quantified at the structural descriptor level through subsequent interpretable ML approaches.

Interpretable supervised learning for predicting ultralow κ _L

After labeling the materials in the second-level dataset based on our HTC framework to obtain the local database, we develop interpretable supervised classification models to predict ultralow κ_L for efficiently by-passing the complex ab initio calculations, which can further refine the coarse result of unsupervised learning and provide greater robustness. Here, materials with κ_PET not exceeding 2 W/mK are labeled as 1, which are considered to possess ultralow κ_L; otherwise, they are labeled as 0, signifying non-ultralow κ_L.

Beyond the input dataset itself, identifying the most relevant features and appropriate algorithms can significantly enhance the generalization and accuracy of ML models, which are also quite strongly intertwined. The chemical compositions and crystal structures of these compounds are also featurized into 273 descriptors based on the aforementioned Magpie data and Voronoi tessellations. As for the ML models, we compare eight widely-used classification algorithms for predicting material properties according to the indicators of area under the receiver operating characteristic curve (ROC AUC) and accuracy based on stratified tenfold cross-validation (Fig. S6): decision tree (DT)⁷⁸, extra trees (ET)⁷⁹, random forest (RF)⁸⁰, gradient boosting classifier (GBC)⁸¹, adaptive boosting (AdaBoost)⁸², eXtreme Gradient Boosting (XGBoost)⁸³, Light Gradient Boosting Machine (LightGBM)⁸⁴, Categorical Boosting (CatBoost)⁴⁵. Most algorithms demonstrate good performances, with the CatBoost algorithm achieving the best results, which is chosen to build the follow-up classification model. To avoid the curse of dimensionality, we perform feature selection using the model-based wrapper method and Pearson correlation method. Descriptors are iteratively removed based on the feature importance scores obtained from the CatBoost algorithm, and an ML model is subsequently trained using the remaining descriptors at each step. This process yields the curve depicting the model performance as a function of the number of features (Fig. S7), indicating that optimal performance is achieved when utilizing the top 11 features. To reduce the correlation between features, the Pearson correlation coefficients (p) of these 11 features are calculated, which are defined as: $p={\rm{cov}}({x}_{i},{x}_{j})/{\sigma }_{{x}_{i}}{\sigma }_{{x}_{j}}$, where cov(x_i, x_j) is the covariance of features x_i and ${x}_{j},{\sigma }_{{x}_{i/j}}$ is the standard deviation of the feature x_i/j. For pairs of features with $\left\vert p\right\vert \,>\,$ 0.8, we retain the one with the higher feature importance score. Figure 6a displays the heatmap of the Pearson correlation coefficient matrix among the selected nine features and illustrates the specific meanings of these features based on elemental attributes and Wigner–Seitz cells, indicating that we have successfully eliminated redundant descriptors. We adjust the hyperparameters of the ML model based on Bayesian optimization over 200 random trials, obtaining the best hyperparameters as listed in Table S4. As a result, the performances of the optimal CatBoost classification model are characterized by the ROC curve and the confusion matrix (Fig. 6b), showing its great ability in classifying semiconductors with ultralow κ_L due to the high ROC AUC (0.94), accuracy (0.90), precision (0.89), recall (0.90) and F1-score (0.89).

**Fig. 6: Supervised learning in the local database and interpretable analysis.**

To establish the bridge between material descriptors and thermal transport mechanisms thereby demystifying the black-box nature of ML, we perform interpretable analysis for the pre-trained κ_L classification model based on the SHAP method^46,47. Figure 6c shows the dendrogram of the optimal nine descriptors via using the hierarchical clustering based on their SHAP values, where a partition line classifies these features into three groups. Interestingly, beyond the minimum relative bond length ${L}^{\min }$, two classes of descriptors based on chemical compositions and crystal structures are identified as we expected. This indicates that ${L}^{\min }$ may possess a unique influence on ultralow κ_L.

The feature importance and the impact of different features on ultralow κ_L are shown in Fig. 6d, which make a global interpretation for the classification model. We here take the most important feature in the three classes as an example to detailedly explain the corresponding hierarchical influence on κ_L combining the SHAP values. For the most important feature ${W}_{{\rm{A}}}^{\min }$, defined as:

$${W}_{{\rm{A}}}^{\min }=\min \left\{{W}_{{\rm{A}}}^{{\rm{1}}},\ldots ,{W}_{{\rm{A}}}^{i},\ldots ,{W}_{{\rm{A}}}^{n}\right\}$$

(1)

Where ${W}_{{\rm{A}}}^{i}$ is the atomic weight of element i and n is the number of elements, respectively. Semiconductors with larger ${W}_{{\rm{A}}}^{\min }$ values tend to exhibit low κ_L since the corresponding positive SHAP values, whereas those with smaller ${W}_{{\rm{A}}}^{\min }$ values show the opposite trend. This can be explained by the dispersion relation for the frequency, just as $\omega =2\sqrt{\beta /M}\left\vert \sin qa/2\right\vert$ in a one-dimensional crystal lattice, where β, M, and a are the bond force constant, atomic mass, and the distance between two neighboring atoms, respectively. Large M (W) can reduce the frequency ω and thus the group velocity υ, resulting in a low κ_L. For the feature α, defined as:

$$\alpha =\frac{N{V}_{\max }}{V}\times 100$$

(2)

Where N is the number of atoms, ${V}_{\max }$ is the largest sphere volume occupied by one atom that can fit inside its Voronoi cell, V is the cell volume. It is obvious that low α tends to have a positive effect on ultralow κ_L since its positive SHAP values, which can be explained by its direct effect on the bond length. α helps in understanding how closely the atoms are bonded, with lower values often signifying larger void space with longer bond lengths in a crystal structure. Longer bond lengths typically result in softer lattices characterized by lower group velocity υ, thereby reducing κ_L. As for the special feature ${L}^{\min }$, defined as:

$${L}^{\min }=\frac{\min \{{L}_{i}\}}{{\rm{mean}}\{{L}_{i}\}}$$

(3)

Where L_i is the weighted average bond length of atom i in a crystal structure, defined as:

$${L}_{i}=\frac{{\sum }_{n}{A}_{n}\cdot {\left\Vert \vec{{r}_{n}}-\vec{{r}_{i}}\right\Vert }_{2}}{{\sum }_{n}{A}_{n}}$$

(4)

Where $\vec{{r}_{i}}$ is the position of atom i, $\vec{{r}_{n}}$ and A_n is the position and the area of the nth neighbor of atom i, respectively. L_i provides a comprehensive description of the interactions between each atom and its nearest neighboring atoms in the crystal structure, with the result influenced by two components: the distance term ${\left\Vert \vec{{r}_{n}}-\vec{{r}_{i}}\right\Vert }_{2}$ and the weighted term A_n. The distance term ${\left\Vert \vec{{r}_{n}}-\vec{{r}_{i}}\right\Vert }_{2}$ directly governs the bond length between the nearest neighboring atoms, with shorter distances implying stronger bonding, thus leading to a lower L_i. The weighted term A_n assigns varying weights to different nearest neighboring atoms, where the bond lengths corresponding to the nearest neighboring atoms with greater weights show a higher effect on L_i. Therefore, $\min \{{L}_{i}\}$ captures the local information within the structure, which may correspond to covalent bonds due to their typical shorter bond lengths. In contrast, mean{L_i} captures the global information of a structure, encompassing both short covalent bonds, long ionic bonds, etc. We propose that the ratio of these two terms, i.e., ${L}^{\min }$, might potentially reflect the anharmonicity of materials. Lower ${L}^{\min }$ values exhibit positive SHAP values, which favor the emergence of ultralow κ_L. This implies that when a structure features lower $\min \{{L}_{i}\}$ and higher mean{L_i}, it may possess rigidly distorted and weakly bound units with a significant bonding hierarchy, typically leading to stronger anharmonicity, as previously analyzed.

The overall contributions of all descriptors of each material to the model output based on their SHAP values are visualized by the t-SNE method (Fig. 6e), showing that ultralow κ_L (dark blue) and non-ultralow κ_L (light blue) are clearly clustered into the corresponding groups. Cs₂GeSe₃ and Cs₂SnSe₃ are then marked out together to reveal the local interpretability using the SHAP force plots (Fig. 6f), thereby exploring their previously discussed unusual difference of κ_L. Among the features that have the most significant impact on these two materials, ${L}^{\min }$ surprisingly reaches a level of importance in Cs₂GeSe₃ comparable to that of ${W}_{{\rm{A}}}^{\min }$. Although ${W}_{{\rm{A}}}^{\min }$ in Cs₂GeSe₃ is lower than in Cs₂SnSe₃, the crucial role of ${L}^{\min }$ in anharmonicity drives the predicted κ_L of Cs₂GeSe₃ to be much lower. Given the significant difference in phonon lifetime τ between these two materials, as indicated by previous phonon thermal transport analysis, we speculate that ${L}^{\min }$ might have a great influence on τ, which can be supported by two perspectives. On the one hand, the trend between ${L}^{\min }$ and τ in Cs₂GeSe₃ and Cs₂SnSe₃ indicates that the former not only exhibits a lower ${L}^{\min }$ but also has a significantly shorter τ (Fig. 6g). This aligns with the impact of ${L}^{\min }$ on anharmonicity revealed by the SHAP analysis. On the other hand, Table S5 indicates that the $\min \{{L}_{i}\}$ values in Cs₂GeSe₃ and Cs₂SnSe₃ correspond to the L values derived from the Voronoi polyhedra centered on Ge and Sn, respectively. Although Cs₂GeSe₃ has a lower mean{L_i} compared to Cs₂SnSe₃, its much lower $\min \{{L}_{i}\}$ (i.e., L_Ge) eventually results in a much lower ${L}^{\min }$. This is consistent with the results based on first-principles analysis, which attributes the difference in phonon lifetime to the stronger bonding strength (shorter bond length) of Ge–Se covalent bonding compared to that of Sn–Se covalent bonding.

Discussion

We propose a novel HiBoFL framework via integrating unsupervised learning and supervised learning to efficiently predict costly-labeled complex properties and uncover structure-property relationships. As a compelling demonstration, this framework has been successfully applied to identify semiconductors with ultralow κ_L, which circumvents large-scale brute-force ab initio calculations without clear objectives. By employing unsupervised learning for the materials from the MP database, we successfully group them into seven clusters. A few hundred materials in the problem-specific clusters C₁ and C₂ with a high likelihood of possessing low κ_L are selected from a pool of hundreds of thousands based on similarity design rules. We further conduct low-cost HTC on materials belonging to the two clusters, establishing a local database for researchers to retrieve and providing a list of candidate materials with potential applications in the TE field. Additionally, statistical analysis of these candidates offers valuable domain knowledge to guide the design of materials with ultralow κ_L. Cs₂GeSe₃ and Cs₂SnSe₃ with ultralow κ_L (~0.25 W/mK) are screened out, in which the anomalous ultralow κ_L of Cs₂GeSe₃ is attributed to the lower τ caused by the difference in covalent bonding. Based on the established local database, we train a robust supervised classification model to refine unsupervised learning, achieving a high ROC AUC of 0.94 on the test set, which can enable the efficient predictions of ultralow κ_L materials. The interpretable analysis of the classification model reveals the mechanisms of key descriptors on κ_L modulation, such as ${W}_{{\rm{A}}}^{\min },\alpha$ and ${L}^{\min }$, etc. The special factor ${L}^{\min }$ is discovered to show a unique influence on structural anharmonicity, leading to the difference of phonon lifetime in Cs₂GeSe₃ and Cs₂SnSe₃, in agreement with the first-principles analysis. We believe that this work provides a novel feasible way for efficiently seeking promising TE materials, in which the proposed HiBoFL framework is also expected to be applied in other material fields.

Methods

First-principles calculations

All the involved DFT-based first-principles calculations were carried out by using the projector-augmented wave (PAW) method⁸⁵ to deal with ion-electron interactions as implemented in the Vienna Ab initio Simulation Package (VASP)⁸⁶, in which the processing of data was conducted using VASPKIT⁸⁷. The electronic exchange-correlation energy was described by the Perdew–Burke–Ernzerhof (PBE) functional under the generalized gradient approximation (GGA)⁸⁸. Our automatic HTC workflow conducted on the candidate structures identified by unsupervised learning was accomplished within the framework of Python Materials Genomics (Pymatgen)⁸⁹. With a plane-wave kinetic energy cutoff of 520 eV and a Γ-centered k-point grid of 2π × 0.04 Å⁻¹ to sample the Brillouin zone, the structure optimization was terminated when the total energy convergence reached below 10⁻⁶ eV and the norms of all the forces were less than 0.01 eV/Å. The elastic properties of each material were calculated by applying a 1% change to the volume of the optimized conventional cell. To evaluate the dynamic stability and extract the second-order interaction-force constants (2nd-order IFCs) of Cs₂SnSe₃ and Cs₂GeSe₃, we used the finite displacement method⁹⁰ for calculations as implemented in the Phonopy package^91,92. The obtained 2nd-order IFCs were utilized to construct the dynamic matrix and compute the corresponding harmonic properties. Additionally, we utilized the script thirdorder.py⁹³ to generate the 2 × 2 × 1 and 2 × 2 × 1 supercells in consideration of the 10th nearest neighbors for Cs₂SnSe₃ and Cs₂GeSe₃, thereby resulting in 1504 and 1500 supercells with displaced atoms for self-consistent calculations, respectively. The third-order interaction force constants (3rd-order IFCs) were further extracted to obtain the three-phonon scattering matrix elements, facilitating the calculations of anharmonic properties. Ultimately, we obtained the convergent κ_L values of these selected materials as implemented in the ShengBTE package⁹⁴ within a 20 × 20 × 20 q-point grid in reciprocal space. The crystal structures and ELF diagrams were visualized using the Crystal Toolkit⁹⁵ and VESTA⁹⁶. Moreover, we calculated the COHP as implemented in the LOBSTER code⁷⁷ to identify the bonding characteristics as bonding, anti-bonding, or non-bonding.

Theoretical framework

Taking into account the extremely high cost of precisely calculating κ_L, we employed the PET model proposed in our previous work⁴⁴ within the HTC framework to obtain the corresponding values at 300 K for materials identified by unsupervised learning. The PET model has established the relationship between intrinsic κ_L and elastic properties, considering both acoustic phonon and optical phonon contributions, which achieves a certain balance between accuracy and efficiency. The empirical equation of κ_PET based on the PET model is expressed as:

$$\,{\kappa }_{{\rm{PET}}}=\frac{{(6{\pi }^{2})}^{2/3}}{3{\pi }^{3}}\cdot \frac{\overline{M}}{T{(\overline{V}N)}^{2/3}}\frac{{\overline{\upsilon }}^{3}}{{\overline{\gamma }}^{2}}+\frac{3{k}_{{\rm{B}}}\overline{\upsilon }}{2{\overline{V}}^{2/3}}{\left(\frac{\pi }{6}\right)}^{1/3}(1-{N}^{-2/3})$$

(5)

Where $\overline{M}$ is the average atomic mass, T is the temperature, $\overline{V}$ is the average atomic volume, N is the number of atoms in the primitive cell, k_B is the Boltzmann constant, $\overline{\upsilon }$ and $\overline{\gamma }$ are the average sound velocity and average Grüneisen parameter, respectively. Among them, we can obtain $\overline{\upsilon }$ as given by:

$$\begin{array}{ll}\overline{\upsilon }\;=\;{\left[\frac{1}{3}\left({\overline{\upsilon }}_{{\rm{L}}}^{-3}+2{\overline{\upsilon }}_{{\rm{T}}}^{-3}\right)\right]}^{-1/3}\\\quad\,=\;{\left\{\frac{1}{3}{\left(\frac{\overline{M}}{\overline{V}}\right)}^{3/2}\left[{\left(B+\frac{4}{3}G\right)}^{-3/2}+2{G}^{-3/2}\right]\right\}}^{-1/3}\end{array}$$

(6)

And $\overline{\gamma }$ can be expressed as:

$$\begin{array}{lll}\overline{\gamma }\;=\;\sqrt{\frac{1}{3}({\overline{\gamma }}_{{\rm{L}}}^{2}+2{\overline{\gamma }}_{{\rm{T}}}^{2})}\\\quad\, =\sqrt{\frac{1}{3}\left\{{\left[\frac{1}{2}\frac{\partial {{\ln }}\,\left(B\,+\,\frac{4}{3}G\right)}{\partial {{\ln }}V}+\frac{1}{6}\right]}^{2}+2{\left(\frac{1}{2}\frac{\partial {{\ln }}\,G}{\partial {{\ln }}\,V}+\frac{1}{6}\right)}^{2}\right\}}\end{array}$$

(7)

Where ${\overline{\upsilon }}_{{\rm{L}}},{\overline{\upsilon }}_{{\rm{T}}},{\overline{\gamma }}_{{\rm{L}}}$ and ${\overline{\gamma }}_{{\rm{T}}}$ are the longitudinal sound velocity, transverse sound velocity, longitudinal Grüneisen parameter, and transverse Grüneisen parameter, respectively. B and G are bulk moduli and shear moduli, respectively. In this manner, we can calculate the κ_PET for different materials derived from their elastic properties within the HTC framework.

To further validate κ_PET from the PET empirical equation and investigate the phonon thermal transport mechanisms, we calculated more accurate κ_L values for Cs₂SnSe₃ and Cs₂SnSe₃ by iteratively solving the phonon BTE⁹⁴^,97:

$${\kappa }_{{\rm{L}}}^{\alpha \beta }=\frac{1}{{k}_{{\rm{B}}}{T}^{2}\Omega N}\mathop{\sum }\limits_{\lambda }{f}_{0}\left({f}_{0}+1\right){\left(\hslash {\omega }_{\lambda }\right)}^{2}{\upsilon }_{\lambda }^{\alpha }{F}_{\lambda }^{\beta }$$

(8)

Where α and β are the Cartesian indexes. k_B, T, Ω, N are the Boltzmann constant, temperature, volume of the unit cell, and regular grid of q points, respectively. λ is a phonon mode including the branch index p and wave vector q, and f₀ is the phonon distribution function based on Bose–Einstein statistics. $\hslash ,{\omega }_{\lambda },{\upsilon }_{\lambda }^{\alpha }$ are the reduced Planck constant, phonon frequency, and phonon group velocity along the α direction, respectively. When only considering two- and three-phonon processes that contribute to scattering, the linearized BTE F_λ takes the form:

$${F}_{\lambda }={\tau }_{\lambda }^{0}({\upsilon }_{\lambda }+{\Delta }_{\lambda })$$

(9)

Where ${\tau }_{\lambda }^{0}$ and Δ_λ are the phonon lifetime of mode λ and corrective term from iteration, respectively.

Machine learning

The chemical compositions and crystal structures of materials were featurized into different descriptors using the Matminer package⁹⁸. All parts related to ML were carried out using the Scikit-learn package⁹⁹. To achieve the automation and acceleration of hyperparameter optimization in supervised learning, we employed the powerful Optuna package¹⁰⁰ for efficiently finding the best hyperparameters. And to provide the interpretable analysis of the black-box ML model, a game-theoretic approach as implemented in the SHAP package^46,47,48 was employed.

For unsupervised learning, the elbow method and the silhouette coefficient method are utilized to determine the optimal k value in k-means clustering. In the elbow method, the sum of squared errors (SSE) is defined as:

$$SSE=\mathop{\sum }\limits_{i=1}^{k}\sum _{x\in {C}_{i}}\parallel\!\! x-{\mu }_{i}{\parallel }^{2}$$

(10)

Where C_i represents the set of data points in cluster i, x is a data point in C_i, μ_i is the centroid of C_i. The silhouette coefficient is defined as:

$$s(x)=\frac{b(x)-a(x)}{\max (a(x),b(x))}$$

(11)

Where b(x) is the average intra-cluster distance and measures how close the data point x is to other points in C_i, a(x) is the average nearest-cluster distance and measures how far data point x is from the closest other cluster. After calculating the silhouette coefficient for all data points, the average value is then taken to obtain the mean silhouette coefficient for different values of k.

For supervised learning, accuracy, precision, recall, F1-score, and ROC AUC are utilized to evaluate the performance metrics of classification models, which are defined as:

$$Accuracy=\frac{TP+TN}{TP+TN+FP+FN}$$

(12)

$$Precision=\frac{TP}{TP+FP}$$

(13)

$$Recall=\frac{TP}{TP+FN}$$

(14)

$$F1\,\text{-}\,socre=\frac{2TP}{2TP+FP+FN}$$

(15)

$$ROC\,AUC=\mathop{\int}\nolimits_{0}^{1}TPR\,d(FPR)$$

(16)

Where TP is the number of true positives, FP is the number of false positives, TN is the number of true negatives, and FN is the number of false negatives. Moreover, TPR and FPR are the true positive rate and false positive rate, respectively, which are defined as:

$$TPR=\frac{TP}{TP+FN}$$

(17)

$$FPR=\frac{FP}{TN+FP}$$

(18)

For interpretable analysis, the SHAP value of a material descriptor is given by:

$${\varphi }_{i}(p)=\sum _{S\subseteq F\setminus \{i\}}\frac{| S| !(| F| -| S| -1)!}{| F| !}\left(p(S\cup \{i\})-p(S)\right)$$

(19)

Where p is the prediction model, F represents the set of all material descriptors, S represents the subset of all descriptors excluding the feature i, p(S) is the model prediction when only descriptors in S are considered, and p(S ∪ {i}) is the model prediction when feature i is also included.

Data availability

The dataset used in this study is publicly available from the Materials Project database: https://next-gen.materialsproject.org/.

Code availability

All the code used in this study is available under the MIT license at this GitHub repository: https://github.com/mf-wu/HiBoFL.

References

Choudhary, K. et al. Recent advances and applications of deep learning methods in materials science. Npj Comput. Mater. 8, 59 (2022).
Article Google Scholar
Luo, Y., Li, M., Yuan, H., Liu, H. & Fang, Y. Predicting lattice thermal conductivity via machine learning: a mini review. Npj Comput. Mater. 9, 4 (2023).
Article Google Scholar
Tang, Z. et al. A deep equivariant neural network approach for efficient hybrid density functional calculations. Nat. Commun. 15, 8815 (2024).
Article CAS PubMed PubMed Central Google Scholar
Chen, Z., Bononi, F. C., Sievers, C. A., Kong, W.-Y. & Donadio, D. UV–visible absorption spectra of solvated molecules by quantum chemical machine learning. J. Chem. Theory Comput. 18, 4891–4902 (2022).
Article CAS PubMed Google Scholar
Deringer, V. L., Caro, M. A. & Csányi, G. Machine learning interatomic potentials as emerging tools for materials science. Adv. Mater. 31, 1902765 (2019).
Article CAS Google Scholar
Yao, Z. et al. Inverse design of nanoporous crystalline reticular materials with deep generative models. Nat. Mach. Intell. 3, 76–86 (2021).
Article Google Scholar
Wu, M. et al. Target-driven design of deep-UV nonlinear optical materials via interpretable machine learning. Adv. Mater. 35, 2300848 (2023).
Article CAS Google Scholar
Wang, X. et al. An interpretable formula for lattice thermal conductivity of crystals. Mater. Today Phys. 48, 101549 (2024).
Article CAS Google Scholar
Fan, T. & Oganov, A. R. Combining machine-learning models with first-principles high-throughput calculations to accelerate the search for promising thermoelectric materials. J. Mater. Chem. C 13, 1439–1448 (2025).
Article CAS Google Scholar
Jain, A. et al. Commentary: the Materials Project: a materials genome approach to accelerating materials innovation. APL Mater. 1, 011002 (2013).
Saal, J. E., Kirklin, S., Aykol, M., Meredig, B. & Wolverton, C. Materials design and discovery with high-throughput density functional theory: the Open Quantum Materials Database (OQMD). Jom 65, 1501–1509 (2013).
Article CAS Google Scholar
Kirklin, S. et al. The Open Quantum Materials Database (OQMD): assessing the accuracy of DFT formation energies. Npj Comput. Mater. 1, 1–15 (2015).
Article Google Scholar
Curtarolo, S. et al. AFLOW: an automatic framework for high-throughput materials discovery. Comput. Mater. Sci. 58, 218–226 (2012).
Article CAS Google Scholar
Choudhary, K. et al. The joint automated repository for various integrated simulations (JARVIS) for data-driven materials design. Npj Comput. Mater. 6, 173 (2020).
Article Google Scholar
Stuke, A. et al. Chemical diversity in molecular orbital energy predictions with kernel ridge regression. J. Chem. Phys. 150, 204121 (2019).
Takahashi, K. & Takahashi, L. Creating machine learning-driven material recipes based on crystal structure. J. Phys. Chem. Lett. 10, 283–288 (2019).
Article CAS PubMed Google Scholar
Oliynyk, A. O., Adutwum, L. A., Harynuk, J. J. & Mar, A. Classifying crystal structures of binary compounds AB through cluster resolution feature selection and support vector machine analysis. Chem. Mater. 28, 6672–6681 (2016).
Article CAS Google Scholar
Torrisi, S. B. et al. Random forest machine learning models for interpretable X-ray absorption near-edge structure spectrum-property relationships. Npj Comput. Mater. 6, 109 (2020).
Article Google Scholar
Ren, Q. et al. Machine-learning-assisted discovery of 212-Zintl-phase compounds with ultra-low lattice thermal conductivity. J. Mater. Chem. A 12, 1157–1165 (2024).
Article CAS Google Scholar
Bharadwaj, N., Manna, S. S., Jena, M. K., Roy, D. & Pathak, B. Unlocking the efficiency of nonaqueous Li-air batteries through the synergistic effect of dual metal site catalysts: an interpretable machine learning approach. J. Mater. Chem. A 12, 15115–15126 (2024).
Zou, X. et al. Unsupervised learning-guided accelerated discovery of alkaline anion exchange membranes for fuel cells. Angew. Chem. Int. Ed. 135, e202300388 (2023).
Article Google Scholar
Jia, X. et al. Unsupervised machine learning for discovery of promising half-heusler thermoelectric materials. Npj Comput. Mater. 8, 34 (2022).
Article CAS Google Scholar
Wang, Z., Cai, J., Wang, Q., Wu, S. & Li, J. Unsupervised discovery of thin-film photovoltaic materials from unlabeled data. Npj Comput. Mater. 7, 128 (2021).
Article CAS Google Scholar
Sootsman, J. R., Chung, D. Y. & Kanatzidis, M. G. New and old concepts in thermoelectric materials. Angew. Chem. Int. Ed. 48, 8616–8639 (2009).
Article CAS Google Scholar
Pernot, G. et al. Precise control of thermal conductivity at the nanoscale through individual phonon-scattering barriers. Nat. Mater. 9, 491–495 (2010).
Article CAS PubMed Google Scholar
Padture, N. P., Gell, M. & Jordan, E. H. Thermal barrier coatings for gas-turbine engine applications. Science 296, 280–284 (2002).
Article CAS PubMed Google Scholar
Gao, Z., Tao, F. & Ren, J. Unusually low thermal conductivity of atomically thin 2D tellurium. Nanoscale 10, 12997–13003 (2018).
Article CAS PubMed Google Scholar
Wang, Y. & Ren, J. Strain-driven switchable thermal conductivity in ferroelastic PdSe₂. ACS Appl. Mater. Interfaces 13, 34724–34731 (2021).
Article CAS PubMed Google Scholar
Xiang, C., Wu, C.-W., Zhou, W.-X., Xie, G. & Zhang, G. Thermal transport in lithium-ion battery: a micro perspective for thermal management. Front. Phys. 17, 1–11 (2022).
Article Google Scholar
Snyder, G. J. & Toberer, E. S. Complex thermoelectric materials. Nat. Mater. 7, 105–114 (2008).
Article CAS PubMed Google Scholar
Nolas, G., Morelli, D. & Tritt, T. M. Skutterudites: a phonon-glass-electron crystal approach to advanced thermoelectric energy conversion applications. Annu. Rev. Mater. Sci. 29, 89–116 (1999).
Article CAS Google Scholar
Snyder, G. J., Christensen, M., Nishibori, E., Caillat, T. & Iversen, B. B. Disordered zinc in Zn₄Sb₃ with phonon-glass and electron-crystal thermoelectric properties. Nat. Mater. 3, 458–463 (2004).
Article CAS PubMed Google Scholar
Mukhopadhyay, S. et al. Two-channel model for ultralow thermal conductivity of crystalline Tl₃VSe₄. Science 360, 1455–1458 (2018).
Article CAS PubMed Google Scholar
Jana, M. K. et al. Intrinsic rattler-induced low thermal conductivity in Zintl type TlInTe₂. J. Am. Chem. Soc. 139, 4350–4353 (2017).
Article CAS PubMed Google Scholar
Jana, M. K., Pal, K., Waghmare, U. V. & Biswas, K. The origin of ultralow thermal conductivity in InTe: Lone-pair-induced anharmonic rattling. Angew. Chem. Int. Ed. 55, 7792–7796 (2016).
Article CAS Google Scholar
Lin, H. et al. Concerted rattling in CsAg₅Te₃ leading to ultralow thermal conductivity and high thermoelectric performance. Angew. Chem. Int. Ed. 55, 11431–11436 (2016).
Article CAS Google Scholar
Nielsen, M. D., Ozolins, V. & Heremans, J. P. Lone pair electrons minimize lattice thermal conductivity. Energy Environ. Sci. 6, 570–578 (2013).
Article CAS Google Scholar
Ward, L., Agrawal, A., Choudhary, A. & Wolverton, C. A general-purpose machine learning framework for predicting properties of inorganic materials. Npj Comput. Mater. 2, 1–7 (2016).
Article Google Scholar
Ward, L. et al. Including crystal structure attributes in machine learning models of formation energies via Voronoi tessellations. Phys. Rev. B 96, 024104 (2017).
Article Google Scholar
Abdi, H. & Williams, L. J. Principal component analysis. Wiley Interdiscip. Rev. Comput. Stat. 2, 433–459 (2010).
Article Google Scholar
Hartigan, J. A. & Wong, M. A. A k-means clustering algorithm. J. R. Stat. Soc. Ser. C. Appl. Stat. 28, 100–108 (1979).
Likas, A., Vlassis, N. & Verbeek, J. J. The global k-means clustering algorithm. Pattern Recognit. 36, 451–461 (2003).
Van der Maaten, L. & Hinton, G. Visualizing data using t-SNE. J. Mach. Learn. Res. 9, 2579–2605 (2008).
Yan, S., Wang, Y., Tao, F. & Ren, J. High-throughput estimation of phonon thermal conductivity from first-principles calculations of elasticity. J. Phys. Chem. A 126, 8771–8780 (2022).
Article CAS PubMed Google Scholar
Prokhorenkova, L., Gusev, G., Vorobev, A., Dorogush, A. V. & Gulin, A. Catboost: unbiased boosting with categorical features. In Adv. Neural Inf. Process. Syst. 31, 6639−6649 (2018).
Shapley, L. S. A value for n-person games. Contributions Theory Games II Ann. Math. Stud. 28, 307–317 (1953).
Google Scholar
Lundberg, S., Lee, S., Guyon, I., Luxburg, U. & Bengio, S. A unified approach to interpreting model predictions. In Adv. Neural Inf. Process. Syst. 30, 4768–4777 (2017).
Lundberg, S. M. et al. From local explanations to global understanding with explainable AI for trees. Nat. Mach. Intell. 2, 56–67 (2020).
Article PubMed PubMed Central Google Scholar
Jaafreh, R., Kang, Y. S. & Hamad, K. Lattice thermal conductivity: an accelerated discovery guided by machine learning. ACS Appl. Mater. Interfaces 13, 57204–57213 (2021).
Article CAS PubMed Google Scholar
Kodinariya, T. M. & Makwana, P. R. et al. Review on determining number of cluster in k-means clustering. Int. J. Adv. Res. Comput. Sci. Manag. Stud. 1, 90–95 (2013).
Cui, M. et al. Introduction to the k-means clustering algorithm based on the elbow method. Acc. Audit. Financ. 1, 5–8 (2020).
Shahapure, K. R. & Nicholas, C. Cluster quality analysis using silhouette score. In 2020 IEEE 7th international conference on data science and advanced analytics (DSAA), 747–748 (IEEE, 2020).
Ewbank, M. D., Newman, P. R. & Kuwamoto, H. Thermal conductivity and specific heat of the chalcogenide salt: Tl₃AsSe₃. J. Appl. Phys. 53, 6450–6452 (1982).
Article CAS Google Scholar
Spitzer, D. Lattice thermal conductivity of semiconductors: a chemical bond approach. J. Phys. Chem. Solids 31, 19–40 (1970).
Article CAS Google Scholar
Skoug, E. J. & Morelli, D. T. Role of lone-pair electrons in producing minimum thermal conductivity in nitrogen-group chalcogenide compounds. Phys. Rev. Lett. 107, 235901 (2011).
Article PubMed Google Scholar
Popovich, N. & Shura, V. Electrical and thermal transport properties of (TlBiS₂)_1-x(2PbS)_x alloys. J. Phys. 15, 5389 (2003).
CAS Google Scholar
Haynes, W. M. CRC Handbook of Chemistry and Physics (CRC Press, 2016).
Beasley, J. D. Thermal conductivities of some novel nonlinear optical materials. Appl. Opt. 33, 1000–1003 (1994).
Article CAS PubMed Google Scholar
Slack, G. A., Tanzilli, R. A., Pohl, R. & Vandersande, J. The intrinsic thermal conductivity of AlN. J. Phys. Chem. Solids 48, 641–647 (1987).
Article CAS Google Scholar
Morelli, D. T. & Slack, G. A. High lattice thermal conductivity solids. In High thermal conductivity materials, 37–68 (Springer, 2006).
Slack, G. A. Nonmetallic crystals with high thermal conductivity. J. Phys. Chem. Solids 34, 321–335 (1973).
Article CAS Google Scholar
Shindé, S. L. & Goela, J. High Thermal Conductivity Materials, vol. 91 (Springer, 2006).
Glassbrenner, C. J. & Slack, G. A. Thermal conductivity of silicon and germanium from 3 K to the melting point. Phys. Rev. 134, A1058 (1964).
Article Google Scholar
MongoDB. https://www.mongodb.com/.
Cao, Y. et al. High-throughput screening of potentially ductile and low thermal conductivity ABX₃ (X= S, Se, Te) thermoelectric perovskites. Appl. Phys. Lett. 124, 092101 (2024).
Li, R. et al. High-throughput screening for advanced thermoelectric materials: diamond-like ABX₂ compounds. ACS Appl. Mater. Interfaces 11, 24859–24866 (2019).
Article CAS PubMed Google Scholar
Guillemot, S. K., Suwardi, A., Kaltsoyannis, N. & Skelton, J. M. Impact of crystal structure on the lattice thermal conductivity of the IV–VI chalcogenides. J. Mater. Chem. A 12, 2932–2948 (2024).
Article CAS Google Scholar
Deng, T. et al. Electronic transport descriptors for the rapid screening of thermoelectric materials. Mater. Horiz. 8, 2463–2474 (2021).
Article CAS PubMed Google Scholar
Plata, J. J., Posligua, V., Márquez, A. M., Fernandez Sanz, J. & Grau-Crespo, R. Charting the lattice thermal conductivities of I–III–VI₂ chalcopyrite semiconductors. Chem. Mater. 34, 2833–2841 (2022).
Article CAS Google Scholar
Posligua, V., Plata, J. J., Márquez, A. M., Sanz, J. F. & Grau-Crespo, R. Theoretical investigation of the lattice thermal conductivities of II–IV–V₂ pnictide semiconductors. ACS Appl. Electron. Mater. 6, 2951–2959 (2023).
Article PubMed PubMed Central Google Scholar
Pei, Y. et al. Convergence of electronic bands for high performance bulk thermoelectrics. Nature 473, 66–69 (2011).
Article CAS PubMed Google Scholar
Heremans, J. P. The anharmonicity blacksmith. Nat. Phys. 11, 990–991 (2015).
Article CAS Google Scholar
Zeng, X. et al. Physical insights on the thermoelectric performance of Cs₂SnBr₆ with ultralow lattice thermal conductivity. J. Phys. Chem. Lett. 13, 9736–9744 (2022).
Article CAS PubMed Google Scholar
Chang, C. & Zhao, L.-D. Anharmoncity and low thermal conductivity in thermoelectrics. Mater. Today Phys. 4, 50–57 (2018).
Article Google Scholar
Li, J., Hu, W. & Yang, J. High-throughput screening of rattling-induced ultralow lattice thermal conductivity in semiconductors. J. Am. Chem. Soc. 144, 4448–4456 (2022).
Article CAS PubMed Google Scholar
Kawano, S., Tadano, T. & Iikubo, S. Effect of halogen ions on the low thermal conductivity of cesium halide perovskite. J. Phys. Chem. C 125, 91–97 (2021).
Article CAS Google Scholar
Dronskowski, R. & Blöchl, P. E. Crystal orbital Hamilton populations (COHP): energy-resolved visualization of chemical bonding in solids based on density-functional calculations. J. Phys. Chem. 97, 8617–8624 (1993).
Article CAS Google Scholar
Safavian, S. R. & Landgrebe, D. A survey of decision tree classifier methodology. IEEE Trans. Syst., man, Cybern. 21, 660–674 (1991).
Article Google Scholar
Geurts, P., Ernst, D. & Wehenkel, L. Extremely randomized trees. Mach. Learn. 63, 3–42 (2006).
Article Google Scholar
Breiman, L. Random forests. Mach. Learn. 45, 5–32 (2001).
Article Google Scholar
Natekin, A. & Knoll, A. Gradient boosting machines, a tutorial. Front. Neurorobot. 7, 21 (2013).
Article PubMed PubMed Central Google Scholar
Freund, Y. & Schapire, R. E. A decision-theoretic generalization of on-line learning and an application to boosting. J. Comput. Syst. Sci. 55, 119–139 (1997).
Article Google Scholar
Chen, T. & Guestrin, C. XGBoost: a scalable tree boosting system. In Proceedings of the 22nd ACM Sigkdd International Conference on Knowledge Discovery and Data Mining, 785–794 (2016).
Ke, G. et al. LightGBM: a highly efficient gradient boosting decision tree. In Adv. Neural Inf. Process. Syst. 30, 3149–3157 (2017).
Kresse, G. & Joubert, D. From ultrasoft pseudopotentials to the projector augmented-wave method. Phys. Rev. B 59, 1758–1775 (1999).
Article CAS Google Scholar
Kresse, G. & Furthmüller, J. Efficient iterative schemes for ab initio total-energy calculations using a plane-wave basis set. Phys. Rev. B 54, 11169 (1996).
Wang, V., Xu, N., Liu, J.-C., Tang, G. & Geng, W.-T. VASPKIT: a user-friendly interface facilitating high-throughput computing and analysis using VASP code. Comput. Phys. Commun. 267, 108033 (2021).
Article CAS Google Scholar
Perdew, J. P., Burke, K. & Ernzerhof, M. Generalized gradient approximation made simple. Phys. Rev. Lett. 77, 3865 (1996).
Article CAS PubMed Google Scholar
Ong, S. P. et al. Python Materials Genomics (pymatgen): a robust, open-source python library for materials analysis. Comput. Mater. Sci. 68, 314–319 (2013).
Article CAS Google Scholar
Baroni, S., De Gironcoli, S., Dal Corso, A. & Giannozzi, P. Phonons and related crystal properties from density-functional perturbation theory. Rev. Mod. Phys. 73, 515 (2001).
Article CAS Google Scholar
Togo, A., Chaput, L., Tadano, T. & Tanaka, I. Implementation strategies in phonopy and phono3py. J. Phys. Condens. Matter 35, 353001 (2023).
Article CAS Google Scholar
Togo, A. First-principles phonon calculations with phonopy and phono3py. J. Phys. Soc. Jpn 92, 012001 (2023).
Article Google Scholar
Li, W., Lindsay, L., Broido, D. A., Stewart, D. A. & Mingo, N. Thermal conductivity of bulk and nanowire Mg₂Si_xSn_1-x alloys from first principles. Phys. Rev. B 86, 174307 (2012).
Article Google Scholar
Li, W., Carrete, J., Katcho, N. A. & Mingo, N. ShengBTE: a solver of the Boltzmann transport equation for phonons. Comp. Phys. Commun. 185, 1747–1758 (2014).
Article CAS Google Scholar
Horton, M. et al. Crystal Toolkit: a web app framework to improve usability and accessibility of materials science research algorithms. Preprint at arXiv https://doi.org/10.48550/arXiv.2302.06147 (2023).
Momma, K. & Izumi, F. VESTA 3 for three-dimensional visualization of crystal, volumetric and morphology data. J. Appl. Crystallogr. 44, 1272–1276 (2011).
Article CAS Google Scholar
Fugallo, G., Lazzeri, M., Paulatto, L. & Mauri, F. Ab initio variational approach for evaluating lattice thermal conductivity. Phys. Rev. B 88, 045430 (2013).
Ward, L. et al. Matminer: an open source toolkit for materials data mining. Comput. Mater. Sci. 152, 60–69 (2018).
Article Google Scholar
Pedregosa, F. et al. Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011).
Google Scholar
Akiba, T., Sano, S., Yanase, T., Ohta, T. & Koyama, M. Optuna: a next-generation hyperparameter optimization framework. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 2623–2631 (2019).

Download references

Acknowledgements

We acknowledge the support from the National Natural Science Foundation of China (No. 11935010), the National Key R&D Program of China (No. 2023YFA1406900 and No. 2022YFA1404400), the Natural Science Foundation of Shanghai (No. 23ZR1481200), the Program of Shanghai Academic Research Leader (No. 23XD1423800), and the Opening Project of Shanghai Key Laboratory of Special Artificial Microstructure Materials and Technology.

Author information

Authors and Affiliations

Shanghai Research Institute for Intelligent Autonomous Systems, and School of Physics Science and Engineering, Tongji University, Shanghai, 200092, China
Mengfan Wu, Shenshen Yan & Jie Ren
Center for Phononics and Thermal Energy Science, Tongji University, Shanghai, 200092, China
Mengfan Wu, Shenshen Yan & Jie Ren

Authors

Mengfan Wu
View author publications
Search author on:PubMed Google Scholar
Shenshen Yan
View author publications
Search author on:PubMed Google Scholar
Jie Ren
View author publications
Search author on:PubMed Google Scholar

Contributions

M.W. and J.R. conceived and designed this research project. M.W. developed the overall framework, conducted the theoretical analysis, performed data visualization, and prepared the manuscript under the supervision of J.R. The high-throughput calculations were carried out with the assistance of S.Y. All authors discussed the results. J.R. provided critical revisions to the manuscript and contributed to project administration.

Corresponding author

Correspondence to Jie Ren.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Wu, M., Yan, S. & Ren, J. Hierarchy-boosted funnel learning for identifying semiconductors with ultralow lattice thermal conductivity. npj Comput Mater 11, 106 (2025). https://doi.org/10.1038/s41524-025-01583-9

Download citation

Received: 06 January 2025
Accepted: 14 March 2025
Published: 22 April 2025
DOI: https://doi.org/10.1038/s41524-025-01583-9