Introduction

Large language models (LLMs) have rapidly transformed natural language processing (NLP) through the adoption of the transformer architecture, first introduced in the seminal work “Attention is All You Need”1. Unlike earlier architectures such as RNNs and LSTMs, transformers utilize multi-head self-attention to effectively model long-range dependencies while enabling highly parallelized training. This innovation facilitated the development of a new generation of large-scale models, exemplified by BERT2 and GPT3, achieving unprecedented performance across diverse NLP tasks. As Transformer-based models scaled in size, they began to exhibit emergent abilities, signifying a paradigm shift from task-specific systems towards general-purpose artificial intelligence.

The revolutionary parallelization capabilities inherent in the Transformer architecture have propelled LLMs into an era of unprecedented scale, yet concurrently ushered in unsustainable resource demands. Initial models like BERT were trainable within days on commodity GPUs; in stark contrast, training contemporary trillion-parameter models necessitates exascale computation. For instance, training PaLM-540B consumed over 8.4 million TPU hours4, and the carbon footprint of GPT-3 reached 552 metric tons of CO2 equivalent5. Empirically underpinned by “Scaling Laws”6, this exponential growth in parameters, data, and computational resources has yielded significant performance gains, but at the substantial cost of soaring computational, memory, and environmental overheads. Crucially, the challenges extend beyond training: fewer than 4% of NLP research studies deploy full-scale LLMs in real-world experiments7, underscoring a growing divide between frontier model development and practical accessibility. Thus, enabling plug-and-play deployment of high-performance models has become a vital objective for practical applications8.

To mitigate the soaring computational costs and facilitate real-world deployment, a broad range of model compression techniques has been developed. These methods aim to reduce the memory footprint, inference latency, and energy consumption of LLMs without incurring prohibitive accuracy degradation. Among them, structured pruning eliminates entire components like attention heads, feedforward blocks, or layers, based on their relative importance, yielding a hardware-friendly sparsity pattern. In contrast, unstructured pruning operates at a finer granularity, removing individual weights or connections, often resulting in higher compression ratios but less predictable hardware acceleration. Quantization techniques reduce the precision of model parameters and activations, replacing standard 16-bit or 32-bit floating-point representations with low-bit formats, thereby achieving dramatic reductions in storage and computational requirements. A parallel line of work explores low-rank decomposition9, which approximates weight matrices using the product of smaller-rank tensors, preserving essential information while reducing parameter count and matrix multiplication complexity. While these compression methods significantly alleviate resource demands, they frequently compromise model reliability. Compressed models can suffer from unstable performance, reduced generalization capacity, and, critically, abrupt capability loss when compression exceeds specific thresholds—a critical phenomenon termed the “Phase Transition Point (PTP),” which underscores the non-linear risks inherent in aggressive compression. Capturing and understanding this dynamic behavior is therefore essential for advancing compression strategies beyond trial-and-error heuristics.

This Perspective introduces the concept of “Model Phase Transition” to fundamentally characterize performance degradation and near-lossless compression limits in LLMs. The “Overview of Compression Techniques” characterizes model redundancy mechanisms and establishes their theoretical orthogonality. The “When Compression Becomes Catastrophic” quantitatively models performance trajectories to pinpoint critical Phase Transition Points across individual and combined methods. The “Criticality-Aware Compression Framework” proposes a transformation of compression into a multi-dimensional trajectory planning problem guided by Phase Avoidance. The “Validation and Perspectives” validates this strategy through comparative experiments, demonstrating that compressed large models outperform native small ones, and offers perspectives on efficient AI. Finally, the “Conclusions and Outlook” summarizes fundamental limits and outlines future research trajectories. Detailed analyses, including mathematical proofs of orthogonality, robustness assessments, low-rank decomposition transitions, combined compression strategies, and benchmarks of ~40 methods, are provided in the Supplementary Information. Related papers and supporting materials will be regularly updated at https://github.com/whucs21Mzy/Model-Phase-Transitions.

Overview of compression techniques

The drive toward efficient LLMs has spawned a spectrum of compression techniques, each navigating distinct trade-offs between computational frugality and functional preservation. As we later reveal, all methods converge toward a universal phase transition boundary where aggressive compression triggers catastrophic collapse. Here, we dissect four dominant paradigms: structured pruning (targeting hardware-friendly substructures), unstructured pruning (maximizing fine-grained sparsity), quantization (reducing numerical precision), and low-rank decomposition (factorizing weight matrices).

Redundancy as the foundation of phase transitions

The existence of model phase transitions fundamentally stems from three complementary forms of redundancy inherent in large-scale neural architectures. These redundancy mechanisms collectively create buffers against compression damage but exhibit critical exhaustion thresholds that trigger phase transitions. Furthermore, we provide a detailed mathematical proof regarding the orthogonality of these redundancy types in the Supplementary Information “Orthogonality of Compression Mechanisms”, justifying their independent analysis.

Structural redundancy

Structural redundancy arises from architectural properties enabling functional preservation under component removal. The Lottery Ticket Hypothesis reveals that dense networks contain efficient subnetworks capable of maintaining full functionality10,11, allowing gradual pruning without immediate collapse. Modern Transformers amplify this through residual connections, where the skip operation x(+1) = x() + f(x(), θ()) mathematically guarantees output stability (x()x(−1) + ϵ). This permits substantial layer removal with minimal functional degradation12. Crucially, dynamic compensation mechanisms allow downstream components to redistribute functionality when upstream elements are compromised, extending the buffer zone before phase transition.

Numerical redundancy

Numerical redundancy arises from the extreme imbalance in weight or activation distributions. The vast majority of values concentrate within a narrow range, while a minority of outliers exert disproportionate influence on outputs. This heavy-tailed distribution enables compression of 99% of values with negligible impact. Critically, quantization error propagates non-uniformly:

Consider the full-precision operation and its quantized counterpart:

$$y=Wx,\,\widehat{y}=Q(W)\,x,$$
(1)

The resulting quantization error decomposes into two distinct components:

$${{\parallel y-\hat{y}\parallel }^{2}}_{2}=\mathop{\underbrace{\mathop{\sum }\limits_{(i,j)\in \text{normal}}{\left[\text{Err}({w}_{ij})\right]}^{2}}}\limits_{\text{negligible}}+\mathop{\underbrace{\mathop{\sum }\limits_{(i,j)\in \text{outliers}}{\left[\text{Err}({w}_{ij})\right]}^{2}}}\limits_{\text{dominant}}$$
(2)

This dominance is intrinsic to the heavy-tailed distribution of LLM parameters13. Standard uniform quantization faces a dilemma: accommodating the wide dynamic range of outliers forces a large quantization step size, increasing error for the dense “normal” region; conversely, narrowing the range to fit normal values clips outliers, causing massive individual errors14,15. Since these outliers often encode critical emergent features, their distortion dominates the total error norm.

Therefore, state-of-the-art strategies prioritize preserving outlier precision. Observing that outliers concentrate in specific channels14,15, methods like AWQ16 perform activation-aware scaling to protect salient weights. SmoothQuant17 mathematically migrates the difficulty of quantization from activations to weights. GPTQ18 further employs second-order Hessian information to iteratively compensate for errors induced by quantizing these critical parameters. These methods collectively validate that effectively managing outlier error is key to extending the compression phase.

Algebraic redundancy

Algebraic redundancy refers to the inherent low-rank property within weight matrices, where model weights and activations, despite being high-dimensional matrices, can be approximated by lower-rank representations. A matrix \(W\in {{\mathbb{R}}}^{m\times n}\) decomposes as

$$W=U\Sigma {V}^{\top }.$$
(3)

This redundancy arises from two primary sources: (1) Linear correlations between neurons, manifested as significant coherence among columns (neurons) of the weight matrix, enabling representation via a minimal set of basis vectors, and (2) The stronger low-rank characteristic of LLM activations compared to weights19. Crucially, the singular values of LLM weight matrices exhibit rapid decay beyond the top-kk values, indicating that most energy concentrates in a low-rank subspace. Smaller singular values contribute minimally to the matrix and can thus be truncated, yielding the approximation

$${W}_{k}={U}_{k}{\Sigma }_{k}{V}_{k}^{\top }.$$
(4)

These redundancy buffers saturate nonlinearly upon reaching critical compression thresholds (PTPs). Structural compensation capacity exhausts first due to component removal, followed by numerical or approximation errors overwhelming outlier preservation and low-rank truncation. Larger models exhibit delayed PTPs due to expanded redundancy buffers, extending the safe compression zone before catastrophic collapse (Fig. 1).

Fig. 1: Model phase transitions and redundancy in model compression.
Fig. 1: Model phase transitions and redundancy in model compression.
Full size image

This figure highlights three main types of redundancy: structural, numerical, and algebraic redundancy. Structural redundancy is managed through pruning, numerical redundancy through quantization, and algebraic redundancy through low-rank decomposition. These redundancies act as buffers, allowing for lossless model compression until the phase transition point is reached. The phase transition point remains stable when different types of compression methods are used together, enabling lossless compression of large models to about 10% of their original size.

Pruning-induced model compression

Structured pruning

Structured pruning involves removing neurons, attention heads, channels, sub-layers, or layers at different levels based on specific rules or zeroing out weights in blocks proportionally (Semi-structured pruning). Structured pruning retains the overall network structure, making it more conducive to hardware acceleration. As noted in the previous work20, structured pruning strategies can be categorized into three types based on pruning criteria and optimization objectives: size-based pruning, regularization-based pruning, and loss-based pruning.

Size-based Pruning removes less important components by measuring the importance of weights, activations, or redundancy with the goal of directly reducing the model size while maintaining performance. Methods like FLAP21 and ShortGPT22 fall under this category. Regularization-based Pruning introduces regularization terms (e.g., L1 regularization or angular distance regularization) into the objective function to constrain the weight distribution, inducing sparsity and selectively removing unimportant components. Examples include Sheared LLaMA23 and SRAD24. Loss-based Pruning quantifies the sensitivity of weights to the loss function to assess the impact of pruning on the overall model performance, prioritizing the removal of components that have minimal effects on the loss. This approach is exemplified by methods like LLM-Pruner25 and SLEB26.

These three pruning strategies offer unique advantages and collectively support the goal of enhancing efficiency and robustness in large-scale models. Table 1 summarizes some structured pruning methods.

Table 1 Summary of structured pruning methods, formulas, and categories

Unstructured pruning

Unstructured pruning is an optimization technique that achieves model sparsity by evaluating the importance of individual weights. Its flexibility and high compression rates make it a key method for optimizing LLMs. Unstructured pruning can achieve extremely high compression rates; for instance, Wanda achieves a 60% sparsity rate on LLaMA-7B with minimal performance degradation across multiple downstream tasks27, while Flash-LLM achieves a 70% sparsity rate on OPT-175B, significantly reducing storage requirements with <2% performance degradation during inference28. However, unstructured pruning often results in irregular sparse patterns in the weight matrix, necessitating specialized hardware accelerators (sparse matrix multiplication units) to efficiently handle sparse matrix computations and fully exploit the benefits of sparsity in terms of storage and computation.

Among various unstructured pruning methods, Magnitude Pruning is the most basic, directly removing weights with small magnitudes. While simple to implement, it does not account for the contextual importance of weights. SparseGPT29, on the other hand, introduces a diagonal Hessian approximation to assess the impact of weights on errors, enabling more precise pruning at the cost of high computational complexity and hardware resource requirements. Wanda27 simplifies the SparseGPT algorithm by eliminating the need for Hessian approximations and instead computing pruning metrics by multiplying weights with input activations. This simplification significantly reduces computational complexity while achieving a balance between high accuracy and efficiency. Following this approach, many subsequent methods use SparseGPT and Wanda as baselines or build upon their foundations. RIA30 introduces a post-training pruning method that re-evaluates the importance of each weight element based on all input and output connections. ADMM31 builds on SparseGPT by incorporating the Alternating Direction Method of Multipliers (ADMM) to restore model performance after pruning, using a simple iterative mask selection process for pruning. OWL32 integrates both Wanda and SparseGPT, proposing the OWL metric to allocate varying pruning rates across different layers. Similarly, BESA33 refines pruning by considering each transformer block’s pruning error and allocating sparsity in a differentiable way, overcoming the perturbations associated with traditional layer-wise approaches. DsnoT34 is also an extension of the SparseGPT and Wanda pruning strategies, introducing a training-free fine-tuning approach that iteratively refines sparse LLMs by adjusting sparse masks, minimizing the reconstruction error between sparse and dense models. Several pruning methods have been developed independently of Wanda and SparseGPT. For example, Flash-LLM28 introduces a “Load-as-Sparse, Compute-as-Dense” strategy, which optimizes memory bandwidth while allowing tensor cores to perform computations as if the model were dense. LoRAPrune35 incorporates LoRA (Low-Rank Adaptation) modules to evaluate the importance of weights and activations, excelling in task-specific pruning scenarios, albeit at the expense of additional computational overhead due to the extra modules. Table 2 summarizes the specific details of these methods.

Table 2 Comparison of pruning algorithms for unstructured pruning in LLMs

Quantization and precision-driven compression

Quantization aims to reduce the precision of model parameters, thereby decreasing storage and computational complexity, significantly improving inference efficiency and hardware compatibility. Specifically, quantization converts floating-point values (e.g., FP32, BF16) into fixed-point or integer values (e.g., INT8, FP4), effectively reducing the computational load and memory consumption during inference. Studies have shown that classical models such as AlexNet and ResNet, when quantized to INT8, can still achieve classification accuracy close to floating-point precision on the ImageNet dataset, demonstrating the effectiveness of quantization36.

Quantization fundamentals

Weight Quantization and Activation Quantization

Weight Quantization and Activation Quantization are two fundamental directions in quantization. Weight quantization converts neural network weights from high-precision floating-point numbers to lower-precision integers, reducing storage requirements and significantly lowering inference power consumption. Activation quantization further reduces memory usage and bandwidth requirements by quantizing intermediate activation values. The distribution of weights and activations plays a critical role in determining quantization precision. For instance, many neural networks exhibit normally distributed or sparse weights, enabling effective performance retention even after clipping outliers or redistributing value ranges37.

Symmetric and Asymmetric Quantization

In symmetric quantization, the quantization intervals for weights and activations are symmetric around zero, while asymmetric quantization allows non-symmetric intervals, which are more effective for complex data distributions. For example, the LSQ (Learned Step Size Quantization) method dynamically learns the quantization step size and adjusts strategies based on the actual distribution of weights and activations, thereby improving the adaptability of low-precision quantization38.

Precision restoration

Quantization-Aware Training

Quantization-Aware Training (QAT) is an optimization strategy that introduces simulated quantization noise during training to adapt models to quantization errors. Studies have shown that introducing quantization noise can act as a form of regularization, akin to data augmentation or Dropout, thereby enhancing model robustness39. For instance, simulating quantization errors during training significantly improves a model’s adaptability to low-precision computations in inference40. Additionally, HAQ (Hardware-Aware Automated Quantization) uses reinforcement learning to automatically determine the optimal quantization bit-width for each layer, balancing resource utilization and performance41.

Representative Post-Training Quantization Techniques (PTQ)

PTQ converts pretrained models to low-precision representations through calibration with minimal data, optimizing memory footprint and inference latency. For Transformer architectures, GPTQ18 pioneered layer-wise 3-4bit quantization through a Hessian-based greedy algorithm that minimizes output reconstruction error. Its optimized implementation achieves full quantization of OPT-175B42 in 4.2 GPU hours with a minimal PPL performance loss (1–3%) after 4-bit quantization, enabling single-A800 deployment. Limitations include GPU dependency during quantization and framework-specific format constraints. AWQ16 offers an adaptive quantization approach that optimizes both weights and activations. By identifying critical weights through activation statistics, AWQ dynamically adjusts quantization granularity. While achieving superior accuracy over GPTQ at equivalent bit-widths, AWQ requires calibration datasets and incurs higher computational overhead. For CPU deployment, GGML introduced SIMD-accelerated low-bit arithmetic via AVX/NEON instructions, later superseded by GGUF’s unified format supporting multi-hardware execution (CUDA/AVX) and enhanced metadata capabilities. GGUF enables extreme compression (1-8bit) with scalable storage, successfully reducing the 671B-parameter DeepSeek-R1 model43 below 140 GB through extreme 1-bit quantization.

Low-rank decomposition for model compression

Low-rank decomposition, as a model compression technique, aims to reduce model size by approximating weight matrices with lower-rank counterparts, leveraging the “algebraic redundancy” in models. Recent advancements in this field address various aspects of model redundancy and computational efficiency. ASVD44 addresses the issue of activation distribution variance by transforming the weight matrix based on the activation distribution, thereby allowing outliers in the activation matrix to be absorbed into the transformed weight matrix and improving decomposition accuracy. This method also incorporates an iterative calibration process to optimize layer-specific decomposition, accounting for the varying sensitivity of different LLM layers. LoSparse45 introduces a novel approach that approximates a weight matrix as the sum of a low-rank matrix and a sparse matrix. This combines the benefits of both low-rank approximations and pruning, overcoming their individual limitations: low-rank methods can ignore the diversity of neurons, and pruning can remove important neurons under high compression rates. Lillama46, on the other hand, observes that while pre-trained Transformer weights are often not inherently low-rank, their activations exhibit low-rank characteristics. It proposes a compression method that locally distills activations with low-rank weights, using SVD for initialization and a joint loss that combines teacher and student activations to accelerate convergence and reduce distillation loss. MoDeGPT47 takes a modular decomposition approach, categorizing Transformer layer weight matrices into three functional modules based on their nonlinearity levels and applying specific matrix decomposition algorithms (Nystróm approximation, CR decomposition, and SVD) to each module to ensure bounded errors. This method reduces hidden dimensions through output reconstruction at a larger structural scale, offering a systematic framework for compression. Similarly, SVD-LLM-V29, building on SVD-LLM, addresses weight redundancy heterogeneity by assigning unique compression ratios to each weight matrix based on its theoretical truncation loss. It also refines the weight truncation process by replacing the traditional Cholesky decomposition with two rounds of SVD, ensuring lower and more stable truncation loss in practice, and thereby optimizing the loss in the weight truncation phase.

When compression becomes catastrophic

As compression techniques push LLMs toward their limits, a striking pattern emerges: performance remains remarkably stable through initial compression, only to collapse abruptly once a critical compression threshold is crossed.

Defining model phase transition

Model phase transition refers to the phenomenon observed during the compression and optimization of LLMs, such as pruning and quantization, where the model shifts abruptly from a phase of gradual performance degradation to a phase of rapid and catastrophic collapse. This phase transition occurs in two distinct stages: (1) in the early stages of compression, performance degradation is gradual and controlled, and the model maintains most of its task effectiveness and robustness; (2) as compression intensifies, the model reaches a critical threshold, the Phase Transition Point, beyond which its performance drops sharply, losing both expressive capacity and task adaptability.

We formally define the operational regime prior to this critical threshold as “near-lossless” compression. Functionally, this implies that the degradation in average downstream task metrics remains within an acceptable tolerance (≤5%), ensuring the model’s utility is largely preserved despite parameter reduction. A more direct statistical observable for this stability is the WikiText-2 Perplexity (PPL), where the allowable variation is ~ΔPPL ≈ 1.5 relative to the dense baseline. For instance, empirical data show that LLaMA2-7B maintains stability as its PPL shifts from ~ 5.5 (dense) to ~ 7.0 (at PTP), and similarly, Qwen2.5-7B transitions from ~ 7.9 to ~ 9.2.

This phenomenon is commonly seen across various compression techniques. For example, structured pruning beyond 50% sparsity or unstructured pruning exceeding 70% often leads to sudden model collapse. Similarly, quantization below 3-bit precision typically results in a sharp decline in task performance.

Quantitative phase transition modeling

To characterize the model phase transition phenomenon across compression methods, we introduce an enhanced piecewise function L(s) modeling performance against compression ratio s. This formulation captures both the gradual degradation and catastrophic collapse phases through distinct mathematical regimes, with continuity enforced at the phase transition point s0:

$$L(s)=\left\{\begin{array}{ll}A\cdot {s}^{\alpha }+B, & s\le {s}_{0}\\ A\cdot {s}_{0}^{\alpha }\cdot \exp \left(\beta (s-{s}_{0})+\gamma {(s-{s}_{0})}^{2}\right), & s > {s}_{0}\end{array}\right.$$
(5)

where s represents the compression ratio (sparsity or quantization precision), s0 denotes the phase transition point, A and α are power-law parameters governing gradual degradation with B as the performance baseline, while β and γ control exponential collapse dynamics beyond s0. The formulation ensures C0 continuity at s0 through the shared baseline term B, with the quadratic coefficient γ enabling precise fitting of accelerated collapse rates observed in empirical data.

The function maintains C0 continuity at s0 with \(L({s}_{0})=A\cdot {s}_{0}^{\alpha }\). The quadratic term \(\gamma {(s-{s}_{0})}^{2}\) enables precise fitting of accelerated collapse rates beyond s0, addressing limitations of pure exponential decay models. This formulation accurately fits empirical data from thirty compression methods while providing interpretable parameters for phase transition analysis.

PTP in structured pruning

To systematically characterize phase transitions in structured pruning, we reproduced several representative methods using LLaMA2-7B as the unified testbed for cross-method compatibility. Performance was evaluated via perplexity on WikiText-2—a standard language modeling benchmark that faithfully reflects degradation in linguistic structure mastery while ensuring alignment with established research protocols (lower PPL indicates superior performance). Figure 2 compares PPL evolution across sparsity levels, revealing critical trade-offs between compression-induced acceleration and accuracy preservation.

Fig. 2: Structured pruning phase transition.
Fig. 2: Structured pruning phase transition.
Full size image

This figure presents the perplexity (PPL) of several structured pruning methods across different sparsity ratios, including both experimental data and fitted curves. The stars indicate the turning points of the piecewise fitting curves, where the x-coordinate corresponds to the model’s phase transition point.

Our experiments demonstrate a consistent phase transition threshold at 30–45% sparsity (Fig. 2). Beyond this inflection point, further compression triggers catastrophic performance collapse, manifested as accelerated PPL degradation. Crucially, structured pruning exhibits significantly lower PTPs than unstructured approaches (detailed in Supplementary Information “Performance and Robustness Under Model Phase Transition”), with most methods tolerating <40% sparsity before collapse. This reduced resilience aligns with structured pruning’s fundamental mechanism: whereas unstructured pruning preserves critical weights through granular removal, structured methods discard entire architectural components (such as attention heads or layers), eliminating vital parameters prematurely. Consequently, performance degradation follows a shallower initial trajectory but reaches collapse thresholds at substantially lower compression intensities.

PTP in unstructured pruning

Recent advancements in unstructured pruning have yielded substantial progress over the past two years. Our systematic evaluation encompasses over a dozen prominent methods applied to the widely supported LLaMA-2-7B model, with perplexity serving as the primary metric for visualizing performance evolution during compression. Similar to structured pruning, the performance-compression curves reveal a definitive model phase transition. Crucially, unstructured pruning exhibits significantly higher PTPs distributed between 0.55–0.65 sparsity (Fig. 3), demonstrating superior compression resilience before collapse compared to structured approaches. This elevated threshold indicates that unstructured pruning can sustain higher compression ratios while maintaining functional integrity.

Fig. 3: Unstructured pruning phase transition.
Fig. 3: Unstructured pruning phase transition.
Full size image

This figure presents the perplexity (PPL) of several unstructured pruning methods across different sparsity ratios, including both experimental data and fitted curves. The stars indicate the turning points of the piecewise fitting curves, where the x-coordinate corresponds to the model’s phase transition point.

Notably, contemporary research frequently emphasizes performance comparisons at extreme compression rates (70% sparsity), positioning this as a primary differentiator. Our experimental evidence challenges this practice: method divergence remains minimal near the PTP (0.55–0.65), while models subjected to 70% sparsity exhibit complete phase transition collapse, rendering them practically unusable. This finding reveals fundamental limitations in the prevailing research paradigm centered on SparseGPT and Wanda derivatives, indicating that current optimization approaches share identical failure modes and require paradigm-shifting innovations to address the core collapse mechanism.

PTP in quantization

In order to systematically evaluate the impact of model quantization on inference performance, we conducted comprehensive experiments on multiple models quantized via the GGUF framework. These experiments covered progressive quantization from 1-bit to 16-bit precision, focusing on several widely adopted LLM families, including LLaMA-248, Qwen-2.549, and Gemma-350, which exhibit strong performance while covering a diverse range of model scales.

First, we used the WikiText-2 dataset to measure both the perplexity degradation and token generation speed for each model under varying quantization bitwidths and strategies. Our results provide a clear illustration of how quantization levels affect model performance (Table S5). Next, we selected the ARC51 and MMLU52 datasets to evaluate the model’s general knowledge and question-answering capabilities. These datasets allow us to observe the impact of progressive quantization on the accuracy of the model across various sizes. We specifically focused on how the model’s performance evolved during the full-scale quantization process (Fig. 4).

Fig. 4: Quantized model performance.
Fig. 4: Quantized model performance.
Full size image

This figure shows the relationship between parameter size (GB) and perplexity (PPL) on WikiText-2 across various quantized large language models (Qwen2.5, LLaMA-2, Gemma-3). Each curve represents a different model family with multiple quantization levels from 2-bit to 16-bit. While performance degradation is smooth at higher precisions, all models exhibit a sharp perplexity spike at 2-bit quantization, identifying a consistent phase transition point where compression becomes catastrophic. Larger models (70 B) demonstrate delayed collapse, indicating greater robustness due to scale.

Phase transition point

A consistent phase transition emerges at 3-bit quantization across all model families. Below this threshold, models exhibit catastrophic nonlinear collapse in WikiText-2 perplexity, with Qwen models showing ≤7% degradation at Q3_K_M versus 13-45% at Q2_K. This pattern is reinforced by knowledge-task performance: Qwen2.5-14B suffers 3× greater accuracy loss in MMLU/ARC benchmarks at 2-bit quantization. Identical transitions occur in LLaMA-2 and Gemma families, confirming 3-bit as the universal stability boundary.

Model-scaling effects

Larger models demonstrate significantly higher phase transition resilience. At 2-bit quantization, 70B-class models preserve 94% baseline PPL and 90% MMLU accuracy (Qwen2.5-72B), while sub-10B models suffer at least 30% PPL degradation and 25% MMLU accuracy loss. The delayed collapse in massive models indicates size-dependent redundancy buffers against information loss.

Compression efficiency

Within the stable phase (3-bit and above), quantization achieves 4–5× model compression while preserving 90% baseline performance across all tasks. Below 3-bit, though compression ratios reach 6–8×, catastrophic collapse in both language modeling (PPL) and knowledge tasks (MMLU/ARC) renders models operationally unusable.

PTP in low-rank decomposition

In the domain of LLMs, low-rank decomposition methods inherently offer limited compression ratios compared to pruning or quantization, as weight matrices in contemporary LLMs often exhibit near-full-rank characteristics. To systematically characterize the phase transition behavior in this algebraic dimension, we evaluated five representative low-rank decomposition methods44,53,54,55,56 on the LLaMA2-7B model. We applied the same piecewise power-law-exponential fitting methodology to pinpoint their critical thresholds.

As illustrated in Fig. 5, the performance trajectories reveal a bifurcation into two distinct phase transition regimes, differentiated by their decomposition objectives. Mode I (Weight-Dominant): Approaches prioritizing static weight reconstruction, exemplified by SFSD and ASVD, encounter premature capability collapse, with PTPs confined to the low range of 16.3%–18.7%. This empirically validates that the intrinsic algebraic redundancy of static weight matrices is critically low, limiting the effectiveness of direct spectral truncation. Mode II (Activation-Centric): Conversely, strategies that leverage the low-rank geometry of the activation space (FLAT-LLM, SoLA) or incorporate truncation-aware compensation (SVD-LLM) demonstrate significantly enhanced robustness. These methods extend the stability frontier to 28.0%–40.0% sparsity, with FLAT-LLM achieving the upper bound. This divergence underscores that while weight matrices approximate full rank, the feature manifold remains highly compressible. Below these thresholds, perplexity degradation is manageable; however, crossing them triggers an immediate and sharp exponential rise in PPL.

Fig. 5: Low-rank decomposition phase transition.
Fig. 5: Low-rank decomposition phase transition.
Full size image

This figure presents the perplexity (PPL) of several low-rank decomposition methods (ASVD, SVD-LLM, SFSD, SoLA, FLAT-LLM) across different sparsity ratios. The scatter points represent experimental data, and the curves show the fitted piecewise power-law-exponential models. The stars indicate the phase transition points (PTPs), marking the critical sparsity threshold beyond which performance degrades sharply.

Phase transitions in combined model compression

Combined model compression refers to the application of multiple compression strategies to aggressively reduce the size of a model, achieving higher compression rates. Currently, the mainstream model compression techniques are broadly classified into three categories. The first category focuses on removing unimportant parameters or neurons, which are represented by structured and unstructured pruning. The second category reduces the bit-width or precision of existing parameters, with model quantization being the primary approach. The third category employs matrix decomposition to reduce the number of parameters, exemplified by low-rank factorization. These three techniques address different aspects of model redundancy, including structural, numerical, and algebraic, each capturing different forms of “model redundancy,” which often coexist in deep neural networks. In other words, a model can simultaneously be sparse, represented by fewer bits, and approximated in a lower-rank subspace. Thus, for large models, combined compression can be viewed as imposing “information bottlenecks” at multiple levels, forcing the model to retain only the most crucial information. This approach can theoretically achieve higher compression rates at the PTP, where the model’s performance begins to degrade rapidly.

We combined several well-performing compression methods from different categories and analyzed their effects on the LLaMA2-7B model. Figure 6 shows the phase transition curve under the synergistic application of Wanda pruning27 and GGUF quantization. The left plot displays the PPL surface for the combined compression, while the right plot shows the contour plot of the same surface. From the left plot, it is evident that pruning has a more significant impact on model performance. Additionally, the combination of pruning and quantization does not substantially affect the phase transition point for either individual compression method (the critical thresholds for pruning and quantization remain around 55% sparsity and 2-bit precision, respectively). The orange star-shaped curve on the right plot represents the model’s phase transition curve, while the red line represents the “cost-effective” curve, showing the lowest PPL for a given compression/memory budget. The intersection of these two curves marks the model’s compression limit, considering the loss of accuracy. In the combined Wanda and GGUF approach, this limit is ~11%, meaning the model can be compressed to ~10% of its original size without significant performance degradation. Other combinations, such as SparseGPT coupled with GPTQ18,29, achieved an extreme retention rate of 12% (pruning sparsity 60% coupled with INT4 quantization, PPL = 8.4), while ADMM integrated with GGUF reached a retention rate of 9% (pruning sparsity 60% coupled with 3-bit quantization, PPL = 7.06)31.

Fig. 6: Combined pruning and quantization.
Fig. 6: Combined pruning and quantization.
Full size image

a 3D surface plot of perplexity (PPL) for LLaMA2-7b under combined GGUF quantization and Wanda pruning, illustrating how PPL varies with different compression settings. b 2D contour projection of the same surface, with the red line marking the most cost-effective compression path (minimal PPL at equivalent compression ratios) and the orange curve showing the phase transition line (PTL), beyond which model performance rapidly deteriorates.

Beyond the remarkable performance of pruning-quantization hybrids, we further explored the interaction between algebraic and numerical redundancies. As shown in Fig. 7, the experimental results reveal a clear stability hierarchy, indicating that compression methods with deeper PTPs (higher robustness) naturally dominate the effective compression space. Specifically, quantization acts as the primary driver of compression due to its superior robustness. Our analysis suggests a hierarchical intervention logic: Quantization is prioritized initially. Only when the quantization compression rate reaches approximately 60% (6-bit) should unstructured pruning methods be introduced to further reduce model size. Furthermore, low-rank decomposition and aggressive pruning ratios should only be considered when quantization approaches its critical PTP limit (76% compression at 3-bit). This sequential activation, exhausting the safe zone of the most robust method before engaging the next, maximizes the compression ratio while maintaining functional integrity, providing the empirical basis for the systematic framework introduced in the next section. These combined compression ceilings align closely with the phase transition thresholds of their constituent methods. For integration with pruning, LoSparse successfully incorporates CoFi’s structured pruning framework, mitigating limitations inherent to standalone approaches57.

Fig. 7: Combined low-rank decomposition and quantization.
Fig. 7: Combined low-rank decomposition and quantization.
Full size image

a 3D surface plot of perplexity (PPL) for LLaMA2-7B under joint ASVD decomposition and GGUF quantization. b 2D contour projection of the same surface. The red line indicates the optimal compression trajectory, while the orange curve marks the Phase Transition Line (PTL). The distinct orthogonal cliffs along both axes illustrate the independence of their respective Phase Transition Points (PTPs).

In conclusion, by observing the phase transition points for various single compression techniques, we can quickly deduce the theoretical compression limit for combined compression before the model undergoes catastrophic degradation. This insight allows for optimizing deployment strategies and minimizing model size while maintaining sufficient performance. For instance, a model that could previously be deployed with a 16 GB memory budget, such as the original LLaMA2-7B, can now be deployed with an extreme compression version of LLaMA2-70B.

Criticality-aware compression framework

We propose a criticality-aware compression framework to address the limitations of ad-hoc compression combinations and provide rigorous guidelines for deployment. This framework fundamentally reframes model compression from an empirical trial-and-error process into a structured trajectory planning problem within a multi-dimensional phase space. By characterizing the critical stability boundaries of the model, we employ a phase avoidance strategy to identify the minimum energy path for optimal compression.

Theoretical orthogonality

The feasibility of our framework is grounded in the orthogonality of compression mechanisms (detailed in Supplementary Information “Orthogonality of Compression Mechanisms”). Since pruning, quantization, and low-rank decomposition target disjoint redundancy subspaces (Spatial, Numerical, and Algebraic), their induced errors are statistically additive rather than multiplicative. This orthogonality implies that applying one method does not significantly shift the Phase Transition Points (PTPs) of others. Consequently, the phase space of a model can be defined as a hyper-rectangle bounded by the individual PTPs of each method. Within this bounded region, the interaction effects are minimal, allowing for predictable performance behavior.

Phase avoidance strategy via trajectory planning

We formalize the phase avoidance strategy not merely as a heuristic, but as a constrained trajectory optimization problem on the model’s potential energy surface. Here, we define the model’s perplexity \({\mathcal{L}}({\mathcal{C}})\) as the potential energy of the system state \({\mathcal{C}}\).

The geometry of degradation

The multi-dimensional phase space is topologically partitioned into two distinct regions by the critical thresholds of each method. The first is the region of graceful degradation (\({{\mathcal{S}}}_{safe}\)), defined as the subspace where the loss function \({\mathcal{L}}\) exhibits convex or linear behavior with respect to the compression ratio. Mathematically, this corresponds to the regime where the perturbation δ introduced by compression satisfies \({\mathcal{L}}(\theta +\delta )\approx {\mathcal{L}}(\theta )+\nabla {{\mathcal{L}}}^{T}\delta\), and higher-order derivatives are negligible. In this region, capability loss is predictable and recoverable. The boundary of this region is the event horizon (\(\partial {\mathcal{S}}\)), formed by the union of individual phase transition points (PTPs): \(\partial {\mathcal{S}}=\{{\mathcal{C}}| s=PT{P}_{prune}\cup b=PT{P}_{quant}\cup r=PT{P}_{rank}\}\). Crossing this boundary drives the system into a chaotic regime where the Hessian spectrum of the loss function undergoes catastrophic changes, leading to exponential performance collapse.

Minimum energy path optimization

The goal of the phase avoidance strategy is to navigate from the dense state to a target compressed state along a minimum energy path. Unlike standard optimization, which seeks a local minimum, this process seeks a trajectory \({\mathcal{T}}\) that maximizes compression while keeping the system’s potential energy (PPL) minimal and strictly within \({{\mathcal{S}}}_{safe}\).

Let \({\mathcal{C}}=(s,b,r)\) be the configuration state vector representing sparsity, bit-width, and rank reduction. The compression problem is formulated as finding the optimal configuration \({{\mathcal{C}}}^{* }\) that minimizes model size subject to stability constraints:

$$\begin{array}{l}\mathop{\min }\limits_{{\mathcal{C}}}\,Size({\mathcal{C}})\\ \,{\rm{s.t.}}\,\,{\mathcal{L}}({\mathcal{C}})-{{\mathcal{L}}}_{base}\le \epsilon \,(\,{\rm{Near}}\; -\; {\rm{Lossless}}\; {\rm{Constraint}})\\ {\mathcal{C}}\in {{\mathcal{S}}}_{safe}\,\,\iff \,\,\{s < PT{P}_{prune},\,b > PT{P}_{quant},\,r > PT{P}_{rank}\}\end{array}$$
(6)

By treating the PTPs as hard constraints (the event horizon), the solver is forced to exploit the redundancy dimensions with the shallowest energy gradients (highest robustness) first, naturally deriving the sequential activation strategy described in Fig. 8.

Fig. 8: Phase avoidance strategy in multi-dimensional compression space.
Fig. 8: Phase avoidance strategy in multi-dimensional compression space.
Full size image

This figure illustrates the minimum energy trajectory for compressing LLaMA2-7B, guided by Perplexity (PPL) as the potential energy function. The 3D space is defined by three orthogonal compression dimensions: Quantization (GGUF), Unstructured Pruning (ADMM), and Low-Rank Decomposition (ASVD). The black solid line represents the optimal compression path, which navigates through the safe zone bounded by the phase transition points (ptps) of each method. The projections on the three planes visualize the pairwise trade-offs. The star marker denotes the final compressed state of our LLaMA2-7B-PTP model (Size: 1.89 GB), achieving a compound compression configuration of 76% Quantization (3-bit), 35% Pruning, and 5% Decomposition, strictly avoiding the collapse regions (red zones).

Heuristic guidelines for compression

Based on our extensive empirical analysis of PTPs and the trajectory shown in Fig. 8, we synthesize the heuristic guidelines for optimal compression planning.

Priority by robustness

Our analysis reveals a clear hierarchy in redundancy robustness: Numerical > Structural > Algebraic. Numerical redundancy, exploited by quantization, exhibits the deepest PTP, remaining robust down to 3-bit precision. Structural redundancy (pruning) follows, tolerating up to ~ 50% sparsity. Algebraic redundancy (decomposition) is the least robust, with PTPs often occurring ~20–30% removal. Consequently, quantization should serve as the primary driver of compression.

Sequential activation

The optimal trajectory suggests a sequential activation strategy aligned with the robustness hierarchy. Quantization is prioritized initially to reduce model size rapidly. Unstructured pruning is introduced only when quantization reaches saturation (such as ~ 60% rate). Low-rank decomposition acts as the final lever, activated only as quantization approaches its critical PTP limit (such as 76% compression at Q3_K_M). This staged approach ensures that the most stable redundancy sources are exhausted before engaging more sensitive ones.

Engineering execution order

Crucially, we distinguish between planning priority and execution sequence. While quantization takes precedence in budget allocation, it must occur last in the actual deployment pipeline (i.e., Decomposition → Pruning → Quantization). Quantization is an irreversible operation that introduces noise and discretizes the optimization landscape; therefore, structural changes (pruning and decomposition) must be performed on high-precision weights first to ensure the accuracy of importance calculations, with quantization applied effectively as a final encapsulation step.

Validation and perspectives

To validate the Phase Avoidance Strategy, we conducted a comparative analysis focusing on the “Compress Big” versus “Native Small” hypothesis. We compared the LLaMA-2-7B model compressed using our PTP-guided framework against natively trained small models (LLaMA-3.2-1B) and newer generation models of similar size (Compressed LLaMA-3.1-8B).

Visualization of the optimal trajectory

Figure 8 visualizes the actual compression path taken for the LLaMA-2-7B experiment. The trajectory strictly adheres to the safe zones defined by the PTPs of Quantization (GGUF), Pruning (ADMM), and Decomposition (ASVD). The final operating point, marked by the star, corresponds to the LLaMA2-7B-PTP model in Table 3. This point represents a sophisticated equilibrium:

Table 3 Comprehensive comparison of criticality-aware compression vs. OOPTP baselines and native small models

Quantization: 3-bit (Q3_K_M), contributing ~ 76% compression.

Pruning: 35% unstructured sparsity, further reducing redundancy without breaking structural integrity.

Decomposition: 5% rank reduction, shaving off the final algebraic redundancy.

This compound configuration yields an 85% total compression rate (1.89 GB final size) while maintaining a PPL of 6.92, demonstrating the efficacy of avoiding single-dimension collapse.

Evaluation benchmarks

We employed a diverse set of benchmarks to rigorously assess model capabilities across language modeling, reasoning, and generation quality. Perplexity (PPL) on WikiText-2 served as the primary indicator of language modeling stability. For reasoning and knowledge, we utilized ARC (Challenge and Easy)51, PIQA58, Winogrande59, HellaSwag60, and BoolQ61 to evaluate common-sense reasoning and factual accuracy. Generation Quality was assessed using ROUGE-1/2/L62 scores on the CNN/DailyMail63 dataset to measure information overlap and fluency, alongside BERTScore64 to evaluate semantic coherence. This comprehensive suite ensures that our compression strategy preserves not just statistical patterns but also the emergent cognitive abilities of LLMs.

PTP-guided compression performance

Table 3 presents the comprehensive results, from which we highlight three key observations regarding the efficacy of our framework.

Superiority of phase avoidance (Group 1)

The LLaMA2-7B-chat-PTP model (Combined Compression) achieves a compact size of 1.89 GB (85% compression ratio) by strictly adhering to the safe zones of all three methods. Despite being smaller than the single-method aggressive baselines, such as Q2_K at 2.36 GB or ADMM at 4.39 GB, it maintains a Wikitext-2 PPL of 6.91. This performance significantly outstrips the collapsed “Out Of Phase Transition Point” (OOPTP) baselines, where ADMM degrades to a PPL of 9.54 and ASVD to 8.59. These results confirm that avoiding the phase transition in multiple dimensions yields superior retention of model capabilities compared to pushing a single dimension to its breaking point.

The “Compress Big” advantage (Group 2)

A critical finding emerges from the comparison between the compressed LLaMA2-7B-chat-PTP (1.89 GB) and the natively trained LLaMA-3.2-1B-Instruct (2.3 GB). Despite being 18% smaller in storage, the compressed 7B model significantly outperforms the native 1B model across almost all benchmarks. For instance, it achieves an ARC-C score of 55.0 compared to 45.0 for the native model, and a BERTScore of 87.06 versus 73.21. This challenges the prevailing industry trend of training small models from scratch, suggesting that compressing larger models allows for the retention of complex “world model” features that small models simply never acquire during pre-training.

Generational robustness (Group 3)

Applying our framework to the newer LLaMA-3.1-8B, we successfully compressed it to 2.4 GB, matching the size of the 1B model. This compressed model achieves state-of-the-art performance for its size class, with an ARC-C score of 72.0 and strong MMLU-implied capabilities. This result further validates the universality of the Phase Avoidance Strategy, demonstrating its applicability and robustness across different model generations.

Perspectives: rethinking efficient AI

Based on the Criticality-Aware Framework and our experimental results, we propose four perspectives to guide future efficient AI development:

  1. (1)

    The Illusion of Scale: Existing LLM architectures exhibit an illusion where parameter count is conflated with capability. Our results show that at least 90% of the parameters in current dense models (like LLaMA-2 and LLaMA-3) are redundant for inference, serving primarily as a scratchpad for optimization during training.

  2. (2)

    Superiority of Compression over Ab Initio Training: We advocate for a paradigm shift from training small models from scratch to compressing large pre-trained models. Large-scale architectures possess the capacity to capture complex, high-dimensional feature representations during pre-training that smaller architectures inherently fail to acquire. Our framework demonstrates that PTP-guided compression preserves these sophisticated representations within a reduced memory footprint, yielding reasoning capabilities significantly superior to those of native models with comparable sizes.

  3. (3)

    Maximizing Information Density for Edge Deployment: In resource-constrained environments (such as edge devices), the industry intuition is often to select a native small model. We argue this is suboptimal. The information density of a compressed large model far exceeds that of a native small model. The golden rule for deployment should be: Always train the largest possible model, then compress it to the target budget using Phase Avoidance.

  4. (4)

    The Event Horizon of Capability: The catastrophic failure of OOPTP models (detailed in Table 3) illustrates that the Phase Transition Point is not merely a performance dip, but an event horizon of model capability. Beyond this critical threshold, the model does not just get weaker; it undergoes a qualitative collapse, losing the emergent abilities that define LLMs. Respecting this horizon is the fundamental constraint of efficient AI.

Conclusions and outlook

This paper systematically revisits the phenomenon of model phase transition, where LLMs transition from controlled performance degradation to catastrophic collapse under progressive compression. Our Perspective integrates theoretical insights, experimental findings, and future research directions. Below, we summarize key discoveries and outline promising research trajectories.

The fundamental limits of compression

Model phase transition theory reveals that compression boundaries are fundamentally governed by critical phase transition points. Our comprehensive analysis establishes distinct PTP distributions across compression paradigms: pruning (65% sparsity for unstructured, 45% for structured), quantization (3-bit precision, equivalent to 77% compression), and low-rank decomposition (30% sparsity). These thresholds originate from three orthogonal redundancy mechanisms—structural, numerical, and algebraic—that collectively constitute the theoretical foundation of phase transitions. Crucially, the orthogonal nature of these redundancies ensures PTP stability under combined compression strategies.

Theoretical framework and deployment implications

Our piecewise power-law-exponential formulation quantitatively models performance-compression curves across methodologies. Beyond identifying Pareto-optimal compression ratios at PTPs, this formalism enables performance prediction under arbitrary memory constraints. By converting the compression problem into a trajectory planning task within the phase space, our Criticality-Aware Framework provides a methodological guarantee for the “Deployment Golden Rule” proposed in “Validation and Perspectives,” enabling near-lossless compression down to 10% of the original model size.

Compute-optimal in LLMs deployment

While recent training literature advocates Compute-Optimal LLMs65 we argue deployment efficiency demands analogous optimization. Current state-of-the-art quantization achieves 80% compression with minimal accuracy loss but remains suboptimal. Future work should pursue hybrid compression, synergistically combining pruning’s structural elimination, quantization’s precision reduction, and decomposition’s rank truncation, to transcend existing PTP limits. Additionally, inference-phase optimizations like KV cache compression warrant equal consideration alongside weight-level compression.

Sustainable AI development

As the industry confronts the impending “Data Wall” and the diminishing marginal returns of purely scaling compute, the trajectory of AI development is shifting from the brute-force “Age of Scaling” to a nuance-driven “Age of Research.” In this new paradigm, the challenge is no longer how much compute can be deployed, but how intelligently it can be utilized. Echoing the “Illusion of Scale,” our MPT framework underscores that parameter redundancy is a strategic resource to be mined, not a burden to be carried. Future advancements must prioritize compression efficiency and architectural elegance, maximizing the computational value per parameter to achieve sustainable, high-density intelligence.