Abstract
Partial differential equations (PDEs) are fundamental for modeling complex physical processes, often exhibiting structural features such as symmetries and conservation laws. While physics-informed neural networks (PINNs) can simulate and invert PDEs, they mainly rely on external loss functions for physical constraints, making it difficult to automatically discover and embed physically consistent network structures. We propose a physics structure-informed neural network discovery method based on physics-informed distillation, which decouples physical and parameter regularization via staged optimization in teacher and student networks. After distillation, clustering and parameter reconstruction are used to extract and embed physically meaningful structures. Numerical experiments on Laplace, Burgers, and Poisson equations, as well as fluid mechanics, show that the method can automatically extract relevant structures, improve accuracy and training efficiency, and enhance structural adaptability and transferability. This approach offers a new perspective for efficient modeling and automatic discovery of structured neural networks.
Introduction
The evolution of complex systems is fundamentally governed by high-dimensional, nonlinear, and multiscale partial differential equations (PDEs), which arise in diverse fields such as geosciences, materials science, fluid dynamics, and biological systems1. Numerical methods such as finite element2 and meshless approaches3 have advanced physical modeling over the past decades, but traditional PDE-solving paradigms are increasingly limited in theory and application when confronted with high-dimensional parameter spaces, extreme computational costs4, the high trial-and-error cost of inverse problems5, and practical constraints such as missing boundary or initial conditions. These methods are also limited in their ability to exploit the underlying structural features of the system. Therefore, there is an urgent need for modeling frameworks that integrate data and physics and possess strong applicability to overcome these bottlenecks.
In recent years, machine learning methods, especially deep learning methods6, has provided new perspectives for scientific modeling. ML can automatically learn couplings and nonlinear mappings between variables from observational data, demonstrating great potential in system identification and equation discovery7, complex system modeling8. However, conventional end-to-end deep learning approaches require large amounts of data and are prone to overfitting, which limits their applicability1. To address these issues, physics-informed neural networks (PINNs)9 have emerged, embedding physical constraints into neural networks (NNs) via loss functions and thus integrating data-driven learning with physical priors. PINNs offer scalability10, flexibility and mesh-free characteristics11, making them powerful tools for solving PDEs and showing promise in inverse problems12. However, most existing PINN approaches focus on enforcing physical constraints through loss functions, while largely ignoring the explicit structural characteristics of the underlying physical systems. Furthermore, external loss functions only minimize the average inconsistency between model predictions and the physical mechanism13. As a result, for problems requiring strong physical consistency, one limitation of PINNs is that their predictions do not strictly adhere to underlying physical conservation laws14.
It is practical to enhance model performance by adding more domain-specific constraints15,16,17. Some methods impose constraints externally to the network, such as Hamiltonian neural networks18, Lagrangian neural networks19, and symplectic networks20, which maintain conservation laws by learning energy-like scalar quantities. Automatic symmetry discovery methods21 encourage networks to preserve symmetries by defining generalized symmetry loss functions. In contrast, another class of effective approaches embeds constraints directly within the internal structure of the network22, thereby ensuring strict satisfaction of these constraints. For example, Rao et al.23 encoded constraints within the network to optimize solutions for reaction-diffusion equations. The underlying principle is that adjusting a system’s internal structure can modify its output to satisfy required constraints24. Furthermore, if the network connectivity itself is designed to match the nature of the problem, the network can achieve superior performance in specialized domains24. For instance, Zhu et al.11 enforced relationships among trainable parameters to embed the space-time parity symmetry (ST-symmetry) of the Ablowitz-Ladik (A-L) equation into the network’s weight arrangement, enabling the simulation of nonlinear dynamic lattice solutions. However, these parameter-structured data-driven models rely on manual construction based on strong prior knowledge for specific problems, resulting in limited structural patterns and restricted applicability. Therefore, developing algorithms for automatic identification and extraction of network structures–reducing dependence on prior knowledge and manual design–would significantly enhance the applicability of structured network data-driven methods.
One effective approach for extracting network structures in NN techniques is regularization25,26. However, applying regularization in PINNs often fails to yield satisfactory results, as shown in Fig. 1. This is not only because PINNs are insensitive to regularization terms in the loss function27, but also because parameter regularization introduces gradient optimization directions that may conflict with the existing physical constraint regularization in PINNs9. Such excessive constraints can actually degrade the accuracy of PINNs1. Distillation learning28, as an effective dual-model training scheme, allows two models to be trained with different loss terms, enabling the student model to achieve accuracy comparable to the teacher model. This provides a means to integrate both physical and parameter regularization.
Inspired by knowledge distillation, we propose a physics structure-informed NN discovery framework (Physics structure-informed neural network, pronounced as Psi-NN and abbreviated as Ψ-NN.) that enables automatic identification, extraction, and reconstruction of network architectures. In Ψ-NN, physical regularization (from governing equations) and parameter regularization are decoupled and applied separately to the teacher and student networks, overcoming the insensitivity to regularization and potential performance degradation observed in conventional PINNs. Physical information is efficiently transferred from the teacher to the student network via distillation, preserving essential physical constraints while expanding the representational capacity of the student model. An optimized structure extraction algorithm then automatically identifies parameter matrices with physical significance, while maximally retaining the feature space of the student network. Finally, a reinitialization mechanism is employed for network reconstruction, ensuring physical consistency in the network structure while endowing the model with applicability. By organically integrating distillation, structure extraction, and network reconstruction, Ψ-NN achieves physical consistency, interpretability, and high-accuracy predictions in structured NNs.
Results
Physics structure-informed neural network
To achieve an automatic, physics informed, and interpretable NN structure extraction mechanism, the Ψ-NN method consists of three components: (A) physics-informed distillation; (B) network parameter matrix extraction; and (C) structured network reconstruction, as illustrated in Fig. 2. The core idea of Ψ-NN is to embed physical information–such as spatiotemporal symmetries and conservation laws–directly into the network architecture. These constraints are encoded by the parameter matrices and reconstructed within the new network structure, thereby endowing the network with physical relevance. Further details are provided in Section 3. An ablation study of the Ψ-NN method is presented in Supplementary Information, validating the necessity of each component.
The teacher network predicts the computational domain using the PINN approach, while the student network is supervised by the teacher’s output, forming a distillation learning process. During the training of the student network, regularization methods are used to naturally drive the parameters into clusters that can be identified by a clustering algorithm under the current physical constraints. Finally, based on the clustering results, parameter matrices related to physical properties are extracted, and ultimatel,y the network structure is reconstructed through structure-embedding (embed the unchanged relation matrix R into a new network).
In the numerical experiments, Ψ-NN achieves the goal of extracting network structures from data under partially known physical laws. The case studies demonstrate that Ψ-NN (A) accurately solves specific physical problems; (B) generalizes across different control parameters within the same problem; and (C) maintains generalizability of the reconstructed network structure across different physical problems. The detailed case settings are provided in Supplementary Information. The results show that Ψ-NN can effectively extract high-performance network structures in problems with partially known laws, yielding good fitting accuracy within the problem domain. The overall workflow is illustrated in Fig. 3, and the error results are summarized in Table 1. Control parameter transfer refers to the case where the form of the PDE is fixed during the inverse problem, but certain parameters (such as the viscosity coefficient in the Burgers equation shown in the figure) are varied.
Using Burgers equation as a sample. Three bold arrows represent the distillation-extraction-reconstruction process. a The whole field data was predicted by data-driven model. b The student network is trained with the structure generating method, using the teacher model’s output. c The structure extraction and reconstruction method is applied to find an optimal network structure for the Burgers equation modeling.
Extraction of network structure from PDEs
We selected several representative PDEs–the Laplace equation, Burgers equation, and Poisson equation–to employ baseline models with prior hard constraints, thereby better demonstrating the generalizability of Ψ-NN. These problems are widely used in physics. In the control group, we use PINNs with post-processing hard mapping13,22(PINN-post) as well as standard PINNs9. The former introduces additional enforced constraints by post-processing the network outputs, while the latter serves as a general NN solver for PDEs. The machine used for the case studies is an Intel 12400f CPU, RTX 4080 GPU. In all cases, the Adam optimizer29 is used for training. To ensure reproducibility, the random seed is fixed at 1234. The computational results are shown in Fig. 4.
a Laplace equation results. b Burgers equation results. Black dots in the exact field represent sampling points. c Poisson equation results. The first column on the left shows the mean square error (MSE) propagation, which is used to compare the optimization speed (the decrease in error within the same number of steps) and optimization accuracy (the final MSE value) of the representative models; the second column on the left shows the ground truth of the cases. The three columns on the right are the results of PINN, PINN-post, and Ψ-NN, respectively. The first row shows the model predictions, and the second row shows the absolute error between the model and the ground truth.
Laplace equation
Laplace’s equation has applications in various fields, including electric fields, heat conduction, and fluid statics30. With appropriate boundary and initial conditions, this equation can exhibit clear symmetry properties, providing more distinct structural features for the network. The Laplace equation is used to fully illustrate the implementation process of Ψ-NN and to demonstrate the interpretability of the Ψ-NN structure.
Consider the steady-state Laplace equation in two-dimensional coordinates \({{{\boldsymbol{x}}}}\in {{\mathbb{R}}}^{2}\) with the following control PDE:
where x = (x1, x2). The boundary conditions of the problem is:
other settings are provided in Supplementary Information.
A. Extracted structure
The Ψ-NN method enables clear extraction of network structures under the guidance of physical laws, whereas other existing methods can negatively impact network accuracy, as detailed in Supplementary Information. Figure 5a shows the evolution of the first hidden layer parameters during training. As the student network loss stabilizes, parameter convergence becomes more pronounced, resulting in extractable network structures. This convergence phenomenon is observed across different layers, and the final parameter clustering results under Ψ-NN are shown in Fig. 5b. The clustering of biases also converges and approaches zero, reducing inter-layer bias features and making the symmetry more evident.
a The evolution of parameters under Ψ-NN method. The loss curve here serves as an indicator of residual stability rather than final accuracy. b Cluster centers of Ψ-NN in Laplace equation. The x-axis represents the absolute value of the weights, and the y-values are given randomly in order to better visualize the distribution. Negative values are shown as red dots, and positive values are shown as blue dots. The cluster distance is set to 0.1. The right column contains distributions of biases, left column contains distributions of weights. The first to fourth rows correspond to the clustering results of the network parameters for the first to third hidden layers and the output layer, respectively. c The structure of student NN after parameter replacement in Laplace equation.
Figure 5 c shows the network structure after replacing the original parameters with the cluster centers.
Since the second hidden layer structure involves reuse, sign reversal, and swapping, we take it as an example to describe in detail the formation of the relation matrix R2 during the “structure extraction" process and its role in the “network reconstruction" process of Ψ-NN. The subscript indicates the second layer. First, after replacing the trainable parameters of the student network with cluster centers, the parameter matrix c2 is:
parameter matrix c2 is constructed as a diagonal matrix with different cluster center parameters arranged on the diagonal and denoted by superscripts as:
After flattening cS2, selecting cluster centers using one-hot vectors, and incorporating sign relationships, the relation matrix R2 is represented as:
In this matrix, rows with the same parameters are duplicated, rows with opposite signs are negated, and the swapping of rows 1,2 and 4,3 (which includes both row swapping and sign reversal in this case) represents the swapping relationship of parameters. The relation matrix R2 stores the relationships between network parameters, with each row representing a selected cluster center. Thus, in the reconstruction process of the new network, the trainable parameter matrix of the first hidden layer \({\hat{{{{\boldsymbol{\theta }}}}}}_{2}\) is constrained by R2 as follows:
After arranging the selected non-zero trainable parameters according to the node order, the following is obtained:
converting to matrix form, W2 is as follows:
The other layers follow similarly. This structure is further represented using low-rank parameter matrices, with weight matrices being denoted as \({{{{\boldsymbol{W}}}}}_{i}^{j}\), where i indicates the layer, and j is the label for the same submatrix. The bias is denoted as b. The architecture is expressed as follows::
This low-rank matrix is reconstructed through structure-embedding (embed the unchanged relation matrix R into a new network)., where the parameters of d (parameter submatrix dimension) nodes are represented by \({{{{\boldsymbol{W}}}}}_{1}^{a}={[{w}_{11}^{a},{w}_{12}^{a},\cdots,{w}_{1d}^{a}]}^{T}\).
By expanding the expressions of the final hidden layer’s two sets of nodes, the relationship between the two-dimensional input x = (x1, x2) and the node outputs ua and ub can be obtained:
From the expressions of both, it follows that:
finally, the output expression is:
where \({{{{\boldsymbol{W}}}}}_{3}^{a}\) is the parameter matrix. This expression can also be written as:
Therefore, this structured method implicitly embeds the symmetry contained in PINN-post
Previous structured PINN approaches11,31 require symmetry priors and rely on manually imposing group equivariance, targeting specific problems, while the network structure in Ψ-NN is automatically extracted from data and physical constraints, reducing reliance on manual design and strong prior knowledge.
B. Comparison between Ψ-NN and PINN
The results are shown in Fig. 4a, and the full-field L2 errors are summarized in Table 1. Compared to PINN, Ψ-NN reduces the number of iterations required to reach the same loss magnitude (1e−3) by approximately 50%, and decreases the final L2 error by about 95%. As illustrated in Fig. 6a and d, PINN does not exhibit consistent symmetry during training, whereas the structural constraints in Ψ-NN, which are consistent with the features of the PDE, enable a reduced search space and allow the solution to be found more quickly and accurately in the early stages of training.
a, d, PINN predictions; b, e, PINN-post predictions; c and f, Ψ-NN predictions. The optimization tendency of different models. In the figures of the first row, the red dashed line represents the true value of u as x2 varies when x1 = 0.8. Each green-blue gradient curve represents the network output at training steps from 1000 to 5000, with an interval of 300. In the second row of figures, the results at several training steps are plotted on a two-dimensional coordinate plane to illustrate the symmetry breaking in PINN during the iterative process, as well as the symmetry preservation in PINN-post and Ψ-NN.
C. Comparison between Ψ-NN and PINN-post
The PINN-post incorporates spatial symmetry into the network output layer through explicit constraints, resulting in outputs that better satisfy symmetric physical properties–reducing the full-field L2 error by approximately 65% compared to conventional PINN, especially within the computational domain. However, the converged MSE of PINN-post is higher than that of Ψ-NN, indicating that the minimum loss value in the PINN framework does not fully reflect the true accuracy of the network, but only the average fit to the available data and PDEs. The rate of loss reduction reflects the convergence speed: to reach a loss of 1e-3, PINN-post requires 5e3 fewer iterations than PINN.
Both Ψ-NN and PINN-post embed spatial symmetry into the network, but the key difference is that the reconstructed Ψ-NN architecture inherently contains this physical property, whereas PINN-post applies hard mapping as a post-processing step at the output layer. In terms of computation time: the computation time for PINN-post is 32.68 minutes, longer than Ψ-NN’s 29.87 minutes. Furthermore, Ψ-NN reaches a loss of 1e−4 with 1.5e4 fewer iterations, and its average convergence speed is about twice that of PINN-post. The L2 error is reduced from 1.159e−3 to 7.422e−5. As shown in Fig. 4a, the hard mapping constraint in PINN-post does not reduce the large computational errors near the boundaries. This suggests that, due to the rich implicit constraints in physical fields, manually embedding features via post-processing provides only a necessary but not sufficient constraint, and its applicability may be limited to a certain range. In contrast, Ψ-NN discovers network structures entirely based on observational data and PDEs, aiming to automatically embed all known information about the physical problem into the network structure, thereby reducing errors over a broader computational domain.
Burgers equation
The Burgers equation, as an important tool for describing nonlinear wave phenomena, is frequently used to study complex systems such as fluid dynamics and wave behavior32. We select the Burgers equation due to its pronounced shock formation and resulting antisymmetric properties, which serve to validate (a) the performance advantages of structured networks and (b) their applicability capability across a wide range of parameter variations.
In the inverse problem, the viscosity coefficient in the Burgers equation is replaced by an unknown parameter λ1, and the governing PDE becomes:
A. Extraction result and performance comparison
In the Ψ-NN extraction process, taking the third hidden layer as an example, the parameter variations are shown in Supplementary Information (Fig. 6). The extracted low-rank parameter matrices are:
Similarly, the relationship between the two-dimensional input x = (x1, x2) and the node outputs ua, ub, uc of the final hidden layer can be obtained:
From the expressions of the three, it follows that:
The expression of the final output u(x1, x2) is:
where \({{{{\boldsymbol{W}}}}}_{3}^{a},{{{{\boldsymbol{W}}}}}_{3}^{b}\) are parameter matrices.
Therefore, the first half of u, \({{{{\boldsymbol{W}}}}}_{3}^{a}\cdot {{{{\boldsymbol{u}}}}}_{a}({x}_{1},{x}_{2})\), contains the symmetry of PINN-post under the condition \({\lim }_{{b}_{1}^{a}\to 0}\), while the second half, \({{{{\boldsymbol{W}}}}}_{3}^{b}\cdot ({{{{\boldsymbol{u}}}}}_{b}({x}_{1},{x}_{2})+{{{{\boldsymbol{u}}}}}_{c}({x}_{1},{x}_{2}))\), directly contains the symmetry of PINN-post. Additionally, the properties of ua in the second hidden layer also indicate that the emergence of symmetry does not strictly depend on the number of network layers.
The trend of the loss function is shown in Fig. 4b. The reconstructed structured network of Ψ-NN exhibits significantly faster iteration speed during training compared to PINN and PINN-post, achieving a minimum loss function accuracy of 1e-5, which is lower than the other two models.
The full-field L2 errors are summarized in Table 1. Both PINN-post and Ψ-NN achieve lower errors than PINN, demonstrating the effectiveness of embedding equation features into the network structure for improving fitting accuracy. As shown in Fig. 4b, after shock formation at t > 0.4, both PINN and PINN-post exhibit large error distributions, whereas Ψ-NN uniformly reduces errors on both sides of the shock. Furthermore, PINN-post, due to its post-processing symmetry, enforces a symmetric error distribution across the shock but does not reduce the actual error. The structure discovery capability of Ψ-NN provides a more precise and matching feature space, resulting in the lowest error.
For the inverse problem, in addition to reconstructing the entire field, another key task is to estimate the unknown parameter λ1 in Eq. (17), with the true value λ1 = 0.01/π. The final results are given in Table 2, where Ψ-NN achieves the closest value to the ground truth. The evolution of this parameter during training is shown in Fig. 7. Since λ1 is included in the trainable parameter vector of the NN, these curves can be interpreted as optimization trajectories6; a shorter path indicates a clearer search direction and faster convergence. Thus, the structural features discovered by Ψ-NN reduce the output space and make it easier to find the correct solution.
a \({\nu }_{1}=\frac{0.01}{\pi }\). b \({\nu }_{2}=\frac{0.04}{\pi }\). c \({\nu }_{3}=\frac{0.08}{\pi }\). Burgers problem parameters comparison. The shorter the path, the clearer the search direction in the parameter space and the faster the convergence. The structural features discovered by Ψ-NN reduce the output space and make it easier to find the correct solution. Moreover, the network structure found in \({\nu }_{1}=\frac{0.01}{\pi }\) can successfully generalize to \({\nu }_{2}=\frac{0.04}{\pi }\) and \({\nu }_{3}=\frac{0.08}{\pi }\), yielding good results.
B. Ψ-NN structure performance in different parameter cases
The viscosity term ν to be solved in the inverse problem is represented by the unknown parameter λ1. This allows us to validate the applicability capability of the structure reconstructed for a specific problem under different parameters. We conducted experiments using the same Ψ-NN-reconstructed structure without modifying other configurations, specifically at ν2 = 0.04/π and ν3 = 0.08/π. The computational results are shown in Fig. 7, where the Ψ-NN method maintains shorter paths across different parameters. Shorter path across different parameters indicates that Ψ-NN has a more efficient optimization process in parameter space6. The final prediction results are summarized in Table 2, with values closest to the ground truth.
Poisson equation
Poisson’s equation plays a crucial role in various computations, including heat conduction, electromagnetism, and gravitational fields33. Here, we select a Poisson problem in a unit square domain with a smooth source term f(x1, x2) that contains four increasing frequencies. This choice allows us to demonstrate the applicability and performance of the Ψ-NN method across different parameters. High-frequency physical systems often exhibit inherent symmetric structures34. To address this, we employ Ψ-NN to discover and leverage the symmetric patterns present in the problem, effectively alleviating the challenges associated with high-frequency characteristics and enhancing modeling efficiency and accuracy. Specifically, Poisson’s equation satisfies the constraint:
where the source term is given by:
The source term exhibits permutation equivariance, which can be reformulated with the equivariant group \(H={{\mathbb{Z}}}_{2}\times {{\mathbb{Z}}}_{2}\) formed by the 2-order cyclic group \({{\mathbb{Z}}}_{2}:\{0,1\}\). The 2 transformations can be stated as ∀ (h1, h2) ∈ H:
where \({{{{\mathcal{T}}}}}_{X}^{h},{{{{\mathcal{T}}}}}_{Y}^{h}\) are the group actions on domain X and codomain Y, respectively.
Similarly, we construct a low-frequency solution to the PDEs (27) with the source term f = 0, ensuring the permutation property of the solution. The permutation property can be simplified using the permutation equivariant group \({H}_{e}={{\mathbb{Z}}}_{2}\) as ∀ h ∈ He:
The constructed solution is \(u={x}_{1}^{2}-{x}_{2}^{2}\). The low-rank parameter matrix extracted for this low-frequency function is:
Similarly, the relationship between the two-dimensional input x = (x1, x2) and the node outputs ua, ub of the final hidden layer can be obtained:
From the expressions of both, it follows that:
Since the expression of the final output u(x1, x2) is:
this result contains the symmetry defined in (30).
In order to match the network structure results with the characteristics of the high-frequency solution, by setting the sign of the values in b1 to be the same, we can obtain:
from which we have:
thus, the final expression:
which satisfies the symmetry of the high-frequency solution.
Moreover, the structure can be adjusted by sign to satisfy other forms of symmetry. For example, after (36), adjusting the sign of the weights in the last hidden layer can yield:
that is, the Poisson equation result can be adjusted by sign to satisfy the central symmetry ur(x1, x2) = ur( − x1, − x2).
The numerical results are shown in Fig. 4c. The iteration step size used in this problem is 1e−3, which is reduced to 0.2 times the original value at steps 5e3 and 1.5e4. The Ψ-NN re-constructed NN outperforms both PINN and PINN-post models in terms of speed and accuracy. Their L2 errors are summarized in Table 1. All three models exhibit peak errors along the line x2 = −x1 + 1, but Ψ-NN maintains a lower overall error, particularly at the boundaries. In contrast, the other models in the control group show large gradient errors at local boundaries, highlighting Ψ-NN’s superior performance in high-frequency fitting.
Steady flow passing a circular cylinder
Re-constructed structured NNs by Ψ-NN not only perform well in their specific problems but also exhibit good applicability across different problems with similar characteristics. Here, we utilize the structures reconstructed from the Laplace problem and the Burgers equation to validate this applicability.
In the field of fluid mechanics, the two-dimensional incompressible laminar cylinder flow case35 can be used as a complex case with multiple outputs and multiple constraints to test the performance of Ψ-NN under multiple output conditions. The outputs selected in this case contain two completely opposite symmetries at the same time, which better reflects the transfer ability of Ψ-NN structures. The control equations are:
the inlet flow rate is:
that is:
The specific settings are shown in Fig. 8a. The results are manifested in Fig. 8, with the loss iteration curve shown in Fig. 8b. The Ψ-NN method demonstrates superior performance in both convergence speed and accuracy, especially around the cylinder. The final L2 errors are summarized in Table 1.
a Cylinder flow setting35. The cylinder center Oc is located at (0.2, 0.2)(m) and the radius is 0.05 m. b Flow field loss comparison. c Flow field pressure results and error. d, u (x-axis velocity) results and error. e, v (y-axis velocity) results and error.
Discussion
This paper presents a novel physics structure-informed network extraction method, termed Ψ-NN. First, Ψ-NN employs a three-step distillation-structure extraction-reconstruction framework to automatically discover and extract network structures consistent with physical constraints from limited sampled data, thereby linking physical information in PDEs with network architecture. This approach overcomes the reliance on prior knowledge and manual design in traditional structured PINN construction, enabling automated structural embedding. Second, Ψ-NN integrates knowledge distillation and parameter regularization-based sparsification through a staged training strategy, introducing a new method for automatic structure extraction and reconstruction, and expanding the application of distillation and regularization in physics-based modeling. Unlike neural architecture search methods focused on hyperparameter optimization36, Ψ-NN emphasizes the automatic identification and representation of physical features among trainable parameters, surpassing the limitations of conventional structured sparsity and achieving structured embedding of physical information. The extracted network structures not only demonstrate strong physical relevance and applicability in numerical experiments, but are also supported by parameter convergence theorems and mathematical proofs, ensuring theoretical rigor and interpretability. Numerical results show that Ψ-NN achieves significantly improved fitting performance and reduced model complexity and computational cost compared to conventional PINNs and post-processed models (PINN-post).
Moreover, Ψ-NN offers a new perspective for network transfer learning. In the case of the Poisson equation, for example, by simplifying the original problem into low- and high-frequency components and further increasing complexity based on the extracted structure, Ψ-NN can efficiently regress from low-frequency to high-frequency solutions. During low-frequency simplification, the Ψ-NN structure remains interpretable, effectively reducing problem complexity and prioritizing computational efficiency. This structural transfer process provides an effective and flexible approach for extracting simple feature structures and performing complex regression across different problems, achieving low resource consumption and high computational accuracy.
Ψ-NN encodes the symbolic relationships of PDEs through network connectivity. While Ψ-NN effectively discovers interpretable structures from partially known problems, it still has some limitations that require further investigation:
-
1.
The Ψ-NN method has been validated on physical problems with known forms of PDEs, and the extracted physical features demonstrate a certain degree of generalizability (for example, in the Poisson equation case, adjusting the sign of hidden layer parameters can achieve central symmetry or anti-symmetry). However, scenarios involving real observational data or genuinely uncertain or incomplete physical constraints may entail more complex parameter-feature relationships, such as time translation symmetry or rotational invariance at arbitrary angles. These properties are often associated with conservation laws via Noether’s theorem37, and may require more sophisticated network architectures or additional exploration in parameter space. We will investigate these potential applications in future research.
-
2.
Since the calculations in Theorems 1 and 2 are based on the properties of MLPs, the current Distillation-Extraction-Reconstruction framework of the Ψ-NN method has been validated using a three-layer fully connected multilayer perceptron (MLP) architecture. The construction of structures involving sign transformations (such as symmetry transformations) relies on the odd symmetry of the tanh activation function, while for permutation transformations (e.g., in the Poisson equation case), this property is not required. However, for more expressive architectures in specialized domains, such as Transformers38 for sequential data, the multi-module and attention mechanisms make it difficult to establish a direct one-to-one correspondence between network parameters and physical features. Nevertheless, these models have the potential to be integrated with the distillation-extraction-reconstruction framework, which will require the development of new structure extraction and reconstruction methods tailored to the characteristics of each architecture. We will explore the integration of different network architectures with the Ψ-NN framework in future work to broaden its applicability.
Methods
Ψ-NN method consists of three main components: distillation, structure extraction, and network reconstruction. The distillation process enables the transfer of physical information without additional intervention, decoupling the optimization of physical and parameter directions by separating the high-order derivative losses from the PDEs. The structure extraction method then automatically identifies low-rank parameter matrices with physical consistency while preserving physical information. Finally, the low-rank parameter matrices are reconstructed to form network structures with physical relevance.
Physics-informed distillation
Regularization, as an effective approach for parameter sparsification, cannot be efficiently applied in PINNs27,39. Moreover, parameter regularization introduces gradient optimization directions that may conflict with the existing physical constraint regularization in PINNs9, and these excessive constraints can actually degrade the accuracy of PINNs1. This makes it challenging to discover network structures related to physical constraints. To address this issue, it is necessary to appropriately decouple the processes of physical constraint enforcement and parameter sparsification. In classification tasks, distillation learning28 has proven to be a successful multi-model training strategy, enabling the student network to be trained without compromising the fitting accuracy of the teacher network. Therefore, we introduce a specialized distillation mechanism that allows learning bias and regularization to coexist.
Li et al.40 improved the distillation method for regression problems originally proposed by Muhamad et al.41, and found that self-distillation can extract and utilize the rich physical information contained in datasets generated by PINN. Inspired by this, we extend the distillation approach to a physics-informed distillation framework, enabling the separation of learning bias while transferring physical information across networks with different architectures.
Consider a PDE with temporal coordinate t and spatial coordinates \({{{\boldsymbol{x}}}}\in {{\mathbb{R}}}^{n}\), whose solution is denoted as \({{{\boldsymbol{u}}}}(t,{{{\boldsymbol{x}}}})\in {{\mathbb{R}}}^{k}\):
Teacher and student networks are defined as:
where θ is the vector of trainable parameters in the network, and the output u is the predicted solution to the PDE (48), with subscripts denoting teacher T and student S, respectively.
To achieve physics-informed supervision, the teacher model is trained following the PINN framework, with details provided in Supplementary Information.
Essentially, the student model is designed to replicate the outputs of the teacher model28. The distillation loss function is given by:
where MT denotes the number of configuration points for the computational field.
Consequently, as a staged training strategy (First, the teacher network is used to predict the computational field, and then the student network is trained to learn the results of the teacher network.), the teacher network bears the learning bias of physical information containing high-order gradient terms (like second-order or higher-order derivatives in PDEs), while the student network is allowed to shift towards the gradient direction of parameter regularization. To train and extract meaningful network structures from the student network, further parameter analysis techniques are required.
Structure extraction method
Regularization has been widely used as an effective parameter sparsification technique in network pruning and structural simplification methods42, effectively optimizing network structures by reducing complexity43. However, extracting physically meaningful and generalizable network structures–such as translation equivariance in convolutional networks or rotational invariance–requires a deeper understanding of parameter relationships26,27. To address this, Ψ-NN refines the student network’s parameter matrices through a specialized clustering approach.
Our structure extraction is based on L2 regularization (parameter smoothing), whose mathematical essence can be derived using the Lagrange multiplier method, promoting parameter convergence within the same layer, as shown in Supplementary Information. The L2 regularization is only applied to the student network, while the teacher network does not use L2 regularization. First, consider L2 regularization on the student network parameters:
where L is the number of layers in the network, ωl is the regularization weight, and θl is the nonlinear affine transformation parameters of the l-th layer. Under L2 regularization, the parameter vector is stretched along its principal eigenvector direction, thereby enhancing major features while suppressing minor ones6.
For the evolution trend of parameters under L2 regularization, we have the following theorem:
Theorem 1
For n trainable parameters θ1, θ2, …, θn in the same hidden layer, if they play equivalent roles in the network, they will converge under L2 regularization.
Theorem 2
For n trainable parameter values ∣θ1∣, ∣θ2∣, …, ∣θn∣ in the same hidden layer, if parameter symmetry exists among them, they will converge under regularization.
The proofs are provided in Supplementary Information. Based on the effect of L2 regularization on parameter values (see Appendix D for details), to ensure that parameters representing similar correlations within each neuron share the same value and to further compress the NN, hierarchical agglomerative clustering (HAC)44 is performed on the absolute values of the weights in each layer. HAC does not require a preset number of clusters and can adaptively retain the necessary cluster centers; it also generates a hierarchical dendrogram, allowing for the selection of an appropriate clustering level through evaluation. We use Euclidean distance to measure the distribution of the data. Details of the clustering algorithm are provided in Supplementary Information.
For the weights θl in the l-th layer, we first compute their absolute values:
Next, we treat \({{{{\boldsymbol{\theta }}}}}_{l}^{{{{\rm{abs}}}}}\) as feature vectors and apply the HAC clustering algorithm. The clustering process is given by:
where K is the number of clusters and Ck denotes the k-th cluster. After clustering, the absolute values of the weights in each cluster Ck are replaced by the cluster center μk:
Finally, the updated weights \({{{{\boldsymbol{\theta }}}}}_{l}^{{{{\rm{new}}}}}\) are given by:
In this way, the n-dimensional trainable parameter vector θl in the l-th layer is reduced to a K-dimensional vector of cluster centers and a two-dimensional sign vector, achieving maximal structural refinement.
After clustering and replacement, the network exhibits a new ordered matrix structure, which may still contain some parameter redundancy and reuse, requiring further analysis. Due to the nature of clustering, using only structured sparsity26 may overlook the relationships between parameters of adjacent hidden layers, thereby failing to identify physically relevant network structures encoded in these parameter relationships. Therefore, instead of pure low-rank decomposition, we adopt a hybrid compression strategy that combines low-rank constraints with structured parameter sharing. Essentially, this approach compresses and simplifies the weight matrix through parameter sharing and structured design, mapping the high-dimensional weight matrix to a low-dimensional structured subspace. Specifically, Ψ-NN not only reduces the rank of the parameter matrix, but also avoids overly complex network structures by identifying redundant reuse in the form of repeated submatrix basis vectors. The detailed analysis is presented in Section 3.
Different from structured pruning nodes (i.e., complete network substructures of weight-activation-weight), the main goal of Ψ-NN in the parameter matrix extraction process is to identify parameter relationships. To maximize the refinement of parameter relationships, Ψ-NN retains the sign features of parameters and refines the clustering objects to the trainable parameter values between each pair of nodes. On this basis, Ψ-NN can transcend traditional sparse strategies that merely merge repeated nodes, resulting in parameter matrices with physical relevance.
Network reconstruction
The core objective of the network reconstruction stage is to enhance the applicability capability of the network while preserving physical relevance. By structurally reconstructing the extracted parameter relationship matrix, Ψ-NN yields a NN architecture that not only incorporates physical constraints but also adapts to new problems. Unlike approaches that rely solely on parameter pruning or zeroing, Ψ-NN reconstruction emphasizes not just parameter compression, but more importantly, the preservation and explicit representation of physical relationships among parameters. As a result, the reconstructed network both reflects underlying physical laws and enables structural transferability and applicability across different problems.
Specifically, we first evaluate the clustered parameter matrix and sparsify redundant or insignificant parameters by zeroing them out, thereby improving parameter efficiency. For the salient parameter subsets (such as cluster centers), we perform reinitialization to restore their trainability. Meanwhile, the structural relationship matrix R is used to enforce consistency of numerical relationships and physical constraints among nodes in the reconstructed network. This approach ensures that the network retains sufficient degrees of freedom for learning while preserving the extracted physical structure.
The relation matrix R encodes the structural relationships among trainable parameters, with each row essentially derived from a transformed one-hot vector. Specifically: (1) If two sets of parameters are identical, the corresponding rows in R are the same, indicating parameter sharing; (2) If the parameters have equal magnitudes but opposite signs, the corresponding row elements in R are −1, representing a sign-reversal relationship; (3) If there is a permutation relationship between parameters, the relevant rows in R are swapped accordingly, reflecting parameter permutation. Thus, the elements of R are typically −1, 0, or 1, corresponding to inverse, unrelated, and direct relationships, respectively. In this way, R systematically expresses structural information such as parameter sharing, sign, and arrangement, and is used during the structure reconstruction stage to constrain the parameter representation of the new network, thereby achieving structured embedding of physical information. A concrete example of the formation and reconstruction process of the relation matrix is provided in the Laplace case study.
This method preserves the iterative fitting capability of the new network and embeds the physics structure into the neural network via the relation matrix R, thus balancing physical consistency with model expressiveness. In this way, Ψ-NN’s network reconstruction not only achieves parameter compression and preservation of physical relationships but also significantly enhances the network’s applicability and interpretability through a structured adaptive fusion mechanism.
Whole implementation
Through the complete process of distillation, structure extraction, and network reconstruction, the Ψ-NN method ultimately achieves the intrinsic design of network structures with physical constraints. Pseudo-code is provided in Supplementary Information to illustrate the entire implementation process. The key steps are summarized as follows:
-
1.
Distillation: In the distillation stage, the choice of teacher network is not unique and depends on the characteristics of the problem. When sufficient understanding of the physical problem exists and a state-of-the-art (SOTA) model is available, selecting the SOTA model is preferable, as its outputs more accurately reflect the true physical scenario. This more precise reconstruction incorporates richer physical information into the generated data, facilitating the extraction of structures that are more physically relevant for the student model. However, when the understanding of the physical problem is limited or an SOTA model is not applicable, more general networks such as PINN or PI-CNN (Physics-informed Convolutional Neural Network45) can be used as the teacher network. In this paper, PINNs are used as the teacher network.
-
2.
Structure extraction: The extraction method not only preserves the learned physical structure but also greatly simplifies the network architecture. During extraction, due to the convergence of structural parameters, the HAC clustering algorithm compresses parameter vectors into smaller cluster center vectors, thereby transforming physical features into network parameter matrices and ensuring the physical relevance of the network framework.
-
3.
Network reconstruction: In the reconstruction process, Ψ-NN maximally retains the original parameter relationships while only reinitializing the trainable parameters, resulting in a final network structure that incorporates physical relevance. This approach allows the Ψ-NN structure to fully utilize trainable parameters, enhancing parameter efficiency and enabling applicability across a broader problem space.
Essentially, the Ψ-NN method can be regarded as a specialized form of regularization–integrating physical constraints directly into the internal network structure to produce problem-specific architectures. Ψ-NN offers several advantages: (A) intrinsic structural features–by constraining the structure of the parameter matrix, the model is guided to learn patterns consistent with specific physical laws, ensuring that outputs naturally satisfy physical constraints; (B) interpretability–the combination of submatrices reveals the underlying composition of input features, providing mathematical consistency; and (C) parameter efficiency–parameter sharing reduces model complexity and improves parameter utilization.
Data availability
The sample data used in this study are available in the Github database under accession code https://github.com/ZitiLiu/Psi-NN.
Code availability
All code accompanying this manuscript is publicly available46.
References
Karniadakis, G. E., Kevrekidis, I. G., Lu, L., Perdikaris, P., Wang, S. & Yang, L. Physics-informed machine learning. Nat. Rev. Phys. 3, 422–440 (2021).
Zienkiewicz, O.C., Taylor, R.L. The Finite Element Method: Its Basis and Fundamentals. Butterworth-Heinemann, Oxford (2000).
Belytschko, T., Krongauz, Y., Organ, D., Fleming, M. & Krysl, P. Meshless methods: An overview and recent developments. Comput. Methods Appl. Mech. Eng. 139, 3–47 (1996).
Moin, P. & Mahesh, K. Direct numerical simulation: A tool in turbulence research. Annu. Rev. Fluid Mech. 30, 539–578 (1998).
Lookman, T., Balachandran, P. V., Xue, D. & Yuan, R. Active learning in materials science with emphasis on adaptive sampling using uncertainties for targeted design. npj Comput. Mater. 5, 21 (2019).
Goodfellow, I., Bengio, Y., Courville, A. Deep Learning. MIT Press, Cambridge, MA (2016).
Schmidt, M. & Lipson, H. Distilling free-form natural laws from experimental data. Science 324, 81–85 (2009).
Brunton, S. L., Proctor, J. L. & Kutz, J. N. Discovering governing equations from data: Sparse identification of nonlinear dynamical systems. Proc. Natl Acad. Sci. 113, 3932–3937 (2016).
Raissi, M., Perdikaris, P. & Karniadakis, G. E. Physics-informed neural networks: A deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations. J. Comput. Phys. 378, 686–707 (2019).
Meng, X., Li, Z., Zhang, D. & Karniadakis, G. E. PPINN: Parareal physics-informed neural network for time-dependent PDEs. Comput. Methods Appl. Mech. Eng. 370, 113250 (2020).
Zhu, W., Khademi, W., Charalampidis, E. G. & Kevrekidis, P. G. Neural networks enforcing physical symmetries in nonlinear dynamical lattices: The case example of the Ablowitz-Ladik model. Phys. D: Nonlinear Phenom. 434, 133264 (2022).
Raissi, M., Yazdani, A. & Karniadakis, G. E. Hidden fluid mechanics: Learning velocity and pressure fields from flow visualizations. Science 367, 1026–1030 (2020).
Chen, Y., Huang, D., Zhang, D., Zeng, J., Wang, N., Zhang, H. & Yan, J. Theory-guided hard constraint projection (HCP): A knowledge-based data-driven scientific machine learning method. J. Comput. Phys. 445, 110624 (2021).
Shin, Y., Zhang, Z., Karniadakis, G.E. Error estimates of residual minimization using neural networks for linear PDES. J. Mach. Learn. Model. Comput. 4(4) (2023).
Zhang, Z.-Y., Zhang, H., Zhang, L.-S. & Guo, L.-L. Enforcing continuous symmetries in physics-informed neural network for solving forward and inverse problems of partial differential equations. J. Comput. Phys. 492, 112415 (2023).
Wang, N., Zhang, D., Chang, H. & Li, H. Deep learning of subsurface flow via theory-guided neural network. J. Hydrol. 584, 124700 (2020).
Lin, S. & Chen, Y. A two-stage physics-informed neural network method based on conserved quantities and applications in localized wave solutions. J. Comput. Phys. 457, 111053 (2022).
Greydanus, S., Dzamba, M., Yosinski, J. Hamiltonian neural networks. Advances in neural information processing systems 32 (2019).
Cranmer, M., Greydanus, S., Hoyer, S., Battaglia, P., Spergel, D. & Ho, S. Lagrangian neural networks. ICLR 2020 Workshop on Integration of Deep Neural Models and Differential Equations (2019).
Chen, Z., Zhang, J., Arjovsky, M. & Bottou, L. Symplectic recurrent neural networks. ICLR 2020 (2019).
Liu, Z. & Tegmark, M. Machine-learning hidden symmetries. Phys. Rev. Lett. 128, 180201 (2022).
Hendriks, J., Jidling, C., Wills, A., Schön, T. Linearly Constrained Neural Networks. arXiv. arXiv:2002.01600, 2020. (2021).
Rao, C., Ren, P., Wang, Q., Buyukozturk, O., Sun, H. & Liu, Y. Encoding physics to learn reaction-diffusion processes. Nat. Mach. Intell. 5, 765–779 (2023).
Michel, A. N., Farrell, J. A. & Porod, W. Qualitative analysis of neural networks. IEEE Trans. Circuits Syst. 36, 229–243 (1989). Conference Name: IEEE Transactions on Circuits and Systems.
Hinton, G. E. & Salakhutdinov, R. R. Reducing the dimensionality of data with neural networks. Science 313, 504–507 (2006).
Zhang, D., Wang, H., Figueiredo, M., Balzano, L. Learning to share: Simultaneous parameter tying and sparsification in deep learning. In: International Conference on Learning Representations (2018).
Fuhg, J. N., Jones, R. E. & Bouklas, N. Extreme sparsification of physics-augmented neural networks for interpretable model discovery in mechanics. Comput. Methods Appl. Mech. Eng. 426, 116973 (2024).
Hinton, G., Vinyals, O., Dean, J. Distilling the Knowledge in a Neural Network. arXiv. arXiv:1503.02531 (2015).
Kingma, D.P., Ba, J. Adam: A Method for Stochastic Optimization. arXiv. arXiv:1412.6980, 2014. (2017).
Shortley, G. H. & Weller, R. The numerical solution of Laplace’s Equation. J. Appl. Phys. 9, 334–348 (2004).
Liu, Z., Liu, Y., Yan, X., Liu, W., Guo, S. & Zhang, C.-a AsPINN: Adaptive symmetry-recomposition physics-informed neural networks. Comput. Methods Appl. Mech. Eng. 432, 117405 (2024).
Hon, Y. C. & Mao, X. Z. An efficient numerical scheme for Burgers’ equation. Appl. Math. Comput. 95, 37–50 (1998).
Jackson, J.D. Classical Electrodynamics, 3rd edn. Wiley, New York (1999).
Xu, Z.-Q. J., Zhang, Y., Luo, T., Xiao, Y. & Ma, Z. Frequency Principle: Fourier Analysis Sheds Light on Deep Neural Networks. Commun. Comput. Phys. 28, 1746–1767 (2020).
Rao, C., Sun, H. & Liu, Y. Physics-informed deep learning for incompressible laminar flows. Theor. Appl. Mech. Lett. 10, 207–212 (2020).
Elsken, T., Metzen, J. H. & Hutter, F. Neural architecture search: A survey. J. Mach. Learn. Res. 20, 1–21 (2019).
Noether, E. Invariant variation problems. Transp. theory Stat. Phys. 1, 186–207 (1971).
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I. Attention is all you need. Advances in neural information processing systems 30, (2017).
Xu, C., Lu, C., Liang, X., Gao, J., Zheng, W., Wang, T. & Yan, S. Multi-loss regularized deep neural network. IEEE Trans. Circuits Syst. Video Technol. 26, 2273–2283 (2016).
Li, Y., Yang, J., Wang, D. Self-knowledge distillation enhanced universal framework for physics-informed neural networks. Nonlinear Dyn. (2025).
Saputra, M.R.U., Gusmao, P.P.B.d., Almalioglu, Y., Markham, A., Trigoni, N. Distilling knowledge from a deep pose regressor network, pp. 263–272 (2019).
Wen, W., Wu, C., Wang, Y., Chen, Y., Li, H. Learning structured sparsity in deep neural networks. Adv. Neural Inf. Process. Syst. 29, (2016).
LeCun, Y., Denker, J., Solla, S. Optimal brain damage. Adv. Neural Inf. Process. Syst. 2, (1989).
Ward Jr, J. H. Hierarchical grouping to optimize an objective function. J. Am. Stat. Assoc. 58, 236–244 (1963).
Zhu, Y., Zabaras, N., Koutsourelakis, P.-S. & Perdikaris, P. Physics-constrained deep learning for high-dimensional surrogate modeling and uncertainty quantification without labeled data. J. Comput. Phys. 394, 56–81 (2019).
Liu, Z., Liu, Y., Yan, X., Liu, W., Nie, H., Guo, S., Zhang, C.-A. Automatic network structure discovery of physics informed neural networks via knowledge distillation. Psi-NN code repository, https://doi.org/10.5281/zenodo.17098398 (2025).
Acknowledgements
This work is supported by the Chinese Academy of Sciences Project for Young Scientists in Basic Research (No.YSBR-107), the Strategic Priority Research Program of Chinese Academy of Science (No.XDB0620402), and the Youth Innovation Promotion Association Chinese Academy of Science (No.2023023). The authors would also like to acknowledge Dr. Yang Xuan for his creation and design of the Featured Image.
Author information
Authors and Affiliations
Contributions
Ziti Liu: Conceptualization, Data curation, Investigation, Methodology, Software, Writing - original draft. Yang Liu: Data curation, Writing - original draft. Xunshi Yan: Software, Formal analysis, Visualization, Supervision. Wen Liu: Investigation, Supervision. Han Nie: Supervision, Validation. Shuaiqi Guo: Visualization, Validation. Chen-an Zhang: Funding acquisition, Investigation, Supervision.
Corresponding authors
Ethics declarations
Competing interests
The authors declare no competing interests.
Peer review
Peer review information
: Nature Communications thanks Kiran Bacsa and the other anonymous reviewer(s) for their contribution to the peer review of this work. A peer review file is available.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Liu, Z., Liu, Y., Yan, X. et al. Automatic network structure discovery of physics informed neural networks via knowledge distillation. Nat Commun 16, 9558 (2025). https://doi.org/10.1038/s41467-025-64624-3
Received:
Accepted:
Published:
Version of record:
DOI: https://doi.org/10.1038/s41467-025-64624-3







