Phase transitions in large language model compression

Ma, Ziyang; Li, Zuchao; Zhang, Lefei; Xia, Gui-Song; Du, Bo; Zhang, Liangpei; Tao, Dacheng

doi:10.1038/s44387-026-00072-8

Download PDF

Perspective
Open access
Published: 06 February 2026

Phase transitions in large language model compression

Ziyang Ma¹,
Zuchao Li²,
Lefei Zhang¹,
Gui-Song Xia²,
Bo Du¹,
Liangpei Zhang³ &
…
Dacheng Tao⁴

npj Artificial Intelligence volume 2, Article number: 21 (2026) Cite this article

3639 Accesses
1 Altmetric
Metrics details

Subjects

Abstract

This perspective argues that Large Language Models exhibit Model Phase Transitions: performance collapses beyond critical compression thresholds. We analyze structural, numerical, and algebraic redundancy across pruning, quantization, and low-rank decomposition techniques. These sources are orthogonal, enabling a criticality-aware compression framework that achieves near-lossless compression to 10% of the original size. This shift proves that compressing a giant is more effective than training a dwarf for efficient, sustainable AI deployment.

Efficient self-attention with smart pruning for sustainable large language models

Article Open access 24 March 2025

Testing AI on language comprehension tasks reveals insensitivity to underlying meaning

Article Open access 14 November 2024

Revealing the intrinsic ethical vulnerability of aligned large language models

Article Open access 21 March 2026

Introduction

Large language models (LLMs) have rapidly transformed natural language processing (NLP) through the adoption of the transformer architecture, first introduced in the seminal work “Attention is All You Need”¹. Unlike earlier architectures such as RNNs and LSTMs, transformers utilize multi-head self-attention to effectively model long-range dependencies while enabling highly parallelized training. This innovation facilitated the development of a new generation of large-scale models, exemplified by BERT² and GPT³, achieving unprecedented performance across diverse NLP tasks. As Transformer-based models scaled in size, they began to exhibit emergent abilities, signifying a paradigm shift from task-specific systems towards general-purpose artificial intelligence.

The revolutionary parallelization capabilities inherent in the Transformer architecture have propelled LLMs into an era of unprecedented scale, yet concurrently ushered in unsustainable resource demands. Initial models like BERT were trainable within days on commodity GPUs; in stark contrast, training contemporary trillion-parameter models necessitates exascale computation. For instance, training PaLM-540B consumed over 8.4 million TPU hours⁴, and the carbon footprint of GPT-3 reached 552 metric tons of CO₂ equivalent⁵. Empirically underpinned by “Scaling Laws”⁶, this exponential growth in parameters, data, and computational resources has yielded significant performance gains, but at the substantial cost of soaring computational, memory, and environmental overheads. Crucially, the challenges extend beyond training: fewer than 4% of NLP research studies deploy full-scale LLMs in real-world experiments⁷, underscoring a growing divide between frontier model development and practical accessibility. Thus, enabling plug-and-play deployment of high-performance models has become a vital objective for practical applications⁸.

To mitigate the soaring computational costs and facilitate real-world deployment, a broad range of model compression techniques has been developed. These methods aim to reduce the memory footprint, inference latency, and energy consumption of LLMs without incurring prohibitive accuracy degradation. Among them, structured pruning eliminates entire components like attention heads, feedforward blocks, or layers, based on their relative importance, yielding a hardware-friendly sparsity pattern. In contrast, unstructured pruning operates at a finer granularity, removing individual weights or connections, often resulting in higher compression ratios but less predictable hardware acceleration. Quantization techniques reduce the precision of model parameters and activations, replacing standard 16-bit or 32-bit floating-point representations with low-bit formats, thereby achieving dramatic reductions in storage and computational requirements. A parallel line of work explores low-rank decomposition⁹, which approximates weight matrices using the product of smaller-rank tensors, preserving essential information while reducing parameter count and matrix multiplication complexity. While these compression methods significantly alleviate resource demands, they frequently compromise model reliability. Compressed models can suffer from unstable performance, reduced generalization capacity, and, critically, abrupt capability loss when compression exceeds specific thresholds—a critical phenomenon termed the “Phase Transition Point (PTP),” which underscores the non-linear risks inherent in aggressive compression. Capturing and understanding this dynamic behavior is therefore essential for advancing compression strategies beyond trial-and-error heuristics.

This Perspective introduces the concept of “Model Phase Transition” to fundamentally characterize performance degradation and near-lossless compression limits in LLMs. The “Overview of Compression Techniques” characterizes model redundancy mechanisms and establishes their theoretical orthogonality. The “When Compression Becomes Catastrophic” quantitatively models performance trajectories to pinpoint critical Phase Transition Points across individual and combined methods. The “Criticality-Aware Compression Framework” proposes a transformation of compression into a multi-dimensional trajectory planning problem guided by Phase Avoidance. The “Validation and Perspectives” validates this strategy through comparative experiments, demonstrating that compressed large models outperform native small ones, and offers perspectives on efficient AI. Finally, the “Conclusions and Outlook” summarizes fundamental limits and outlines future research trajectories. Detailed analyses, including mathematical proofs of orthogonality, robustness assessments, low-rank decomposition transitions, combined compression strategies, and benchmarks of ~40 methods, are provided in the Supplementary Information. Related papers and supporting materials will be regularly updated at https://github.com/whucs21Mzy/Model-Phase-Transitions.

Overview of compression techniques

The drive toward efficient LLMs has spawned a spectrum of compression techniques, each navigating distinct trade-offs between computational frugality and functional preservation. As we later reveal, all methods converge toward a universal phase transition boundary where aggressive compression triggers catastrophic collapse. Here, we dissect four dominant paradigms: structured pruning (targeting hardware-friendly substructures), unstructured pruning (maximizing fine-grained sparsity), quantization (reducing numerical precision), and low-rank decomposition (factorizing weight matrices).

Redundancy as the foundation of phase transitions

The existence of model phase transitions fundamentally stems from three complementary forms of redundancy inherent in large-scale neural architectures. These redundancy mechanisms collectively create buffers against compression damage but exhibit critical exhaustion thresholds that trigger phase transitions. Furthermore, we provide a detailed mathematical proof regarding the orthogonality of these redundancy types in the Supplementary Information “Orthogonality of Compression Mechanisms”, justifying their independent analysis.

Structural redundancy

Structural redundancy arises from architectural properties enabling functional preservation under component removal. The Lottery Ticket Hypothesis reveals that dense networks contain efficient subnetworks capable of maintaining full functionality^10,11, allowing gradual pruning without immediate collapse. Modern Transformers amplify this through residual connections, where the skip operation x^(ℓ+1) = x^(ℓ) + f(x^(ℓ), θ^(ℓ)) mathematically guarantees output stability (x^(ℓ) ≈ x^(ℓ−1) + ϵ). This permits substantial layer removal with minimal functional degradation¹². Crucially, dynamic compensation mechanisms allow downstream components to redistribute functionality when upstream elements are compromised, extending the buffer zone before phase transition.

Numerical redundancy

Numerical redundancy arises from the extreme imbalance in weight or activation distributions. The vast majority of values concentrate within a narrow range, while a minority of outliers exert disproportionate influence on outputs. This heavy-tailed distribution enables compression of 99% of values with negligible impact. Critically, quantization error propagates non-uniformly:

Consider the full-precision operation and its quantized counterpart:

$$y=Wx,\,\widehat{y}=Q(W)\,x,$$

(1)

The resulting quantization error decomposes into two distinct components:

$${{\parallel y-\hat{y}\parallel }^{2}}_{2}=\mathop{\underbrace{\mathop{\sum }\limits_{(i,j)\in \text{normal}}{\left[\text{Err}({w}_{ij})\right]}^{2}}}\limits_{\text{negligible}}+\mathop{\underbrace{\mathop{\sum }\limits_{(i,j)\in \text{outliers}}{\left[\text{Err}({w}_{ij})\right]}^{2}}}\limits_{\text{dominant}}$$

(2)

This dominance is intrinsic to the heavy-tailed distribution of LLM parameters¹³. Standard uniform quantization faces a dilemma: accommodating the wide dynamic range of outliers forces a large quantization step size, increasing error for the dense “normal” region; conversely, narrowing the range to fit normal values clips outliers, causing massive individual errors^14,15. Since these outliers often encode critical emergent features, their distortion dominates the total error norm.

Therefore, state-of-the-art strategies prioritize preserving outlier precision. Observing that outliers concentrate in specific channels^14,15, methods like AWQ¹⁶ perform activation-aware scaling to protect salient weights. SmoothQuant¹⁷ mathematically migrates the difficulty of quantization from activations to weights. GPTQ¹⁸ further employs second-order Hessian information to iteratively compensate for errors induced by quantizing these critical parameters. These methods collectively validate that effectively managing outlier error is key to extending the compression phase.

Algebraic redundancy

Algebraic redundancy refers to the inherent low-rank property within weight matrices, where model weights and activations, despite being high-dimensional matrices, can be approximated by lower-rank representations. A matrix $W\in {{\mathbb{R}}}^{m\times n}$ decomposes as

$$W=U\Sigma {V}^{\top }.$$

(3)

This redundancy arises from two primary sources: (1) Linear correlations between neurons, manifested as significant coherence among columns (neurons) of the weight matrix, enabling representation via a minimal set of basis vectors, and (2) The stronger low-rank characteristic of LLM activations compared to weights¹⁹. Crucially, the singular values of LLM weight matrices exhibit rapid decay beyond the top-kk values, indicating that most energy concentrates in a low-rank subspace. Smaller singular values contribute minimally to the matrix and can thus be truncated, yielding the approximation

$${W}_{k}={U}_{k}{\Sigma }_{k}{V}_{k}^{\top }.$$

(4)

These redundancy buffers saturate nonlinearly upon reaching critical compression thresholds (PTPs). Structural compensation capacity exhausts first due to component removal, followed by numerical or approximation errors overwhelming outlier preservation and low-rank truncation. Larger models exhibit delayed PTPs due to expanded redundancy buffers, extending the safe compression zone before catastrophic collapse (Fig. 1).

**Fig. 1: Model phase transitions and redundancy in model compression.**

Pruning-induced model compression

Structured pruning

Structured pruning involves removing neurons, attention heads, channels, sub-layers, or layers at different levels based on specific rules or zeroing out weights in blocks proportionally (Semi-structured pruning). Structured pruning retains the overall network structure, making it more conducive to hardware acceleration. As noted in the previous work²⁰, structured pruning strategies can be categorized into three types based on pruning criteria and optimization objectives: size-based pruning, regularization-based pruning, and loss-based pruning.

Size-based Pruning removes less important components by measuring the importance of weights, activations, or redundancy with the goal of directly reducing the model size while maintaining performance. Methods like FLAP²¹ and ShortGPT²² fall under this category. Regularization-based Pruning introduces regularization terms (e.g., L₁ regularization or angular distance regularization) into the objective function to constrain the weight distribution, inducing sparsity and selectively removing unimportant components. Examples include Sheared LLaMA²³ and SRAD²⁴. Loss-based Pruning quantifies the sensitivity of weights to the loss function to assess the impact of pruning on the overall model performance, prioritizing the removal of components that have minimal effects on the loss. This approach is exemplified by methods like LLM-Pruner²⁵ and SLEB²⁶.

These three pruning strategies offer unique advantages and collectively support the goal of enhancing efficiency and robustness in large-scale models. Table 1 summarizes some structured pruning methods.

Table 1 Summary of structured pruning methods, formulas, and categories

Full size table

Unstructured pruning

Unstructured pruning is an optimization technique that achieves model sparsity by evaluating the importance of individual weights. Its flexibility and high compression rates make it a key method for optimizing LLMs. Unstructured pruning can achieve extremely high compression rates; for instance, Wanda achieves a 60% sparsity rate on LLaMA-7B with minimal performance degradation across multiple downstream tasks²⁷, while Flash-LLM achieves a 70% sparsity rate on OPT-175B, significantly reducing storage requirements with <2% performance degradation during inference²⁸. However, unstructured pruning often results in irregular sparse patterns in the weight matrix, necessitating specialized hardware accelerators (sparse matrix multiplication units) to efficiently handle sparse matrix computations and fully exploit the benefits of sparsity in terms of storage and computation.

Among various unstructured pruning methods, Magnitude Pruning is the most basic, directly removing weights with small magnitudes. While simple to implement, it does not account for the contextual importance of weights. SparseGPT²⁹, on the other hand, introduces a diagonal Hessian approximation to assess the impact of weights on errors, enabling more precise pruning at the cost of high computational complexity and hardware resource requirements. Wanda²⁷ simplifies the SparseGPT algorithm by eliminating the need for Hessian approximations and instead computing pruning metrics by multiplying weights with input activations. This simplification significantly reduces computational complexity while achieving a balance between high accuracy and efficiency. Following this approach, many subsequent methods use SparseGPT and Wanda as baselines or build upon their foundations. RIA³⁰ introduces a post-training pruning method that re-evaluates the importance of each weight element based on all input and output connections. ADMM³¹ builds on SparseGPT by incorporating the Alternating Direction Method of Multipliers (ADMM) to restore model performance after pruning, using a simple iterative mask selection process for pruning. OWL³² integrates both Wanda and SparseGPT, proposing the OWL metric to allocate varying pruning rates across different layers. Similarly, BESA³³ refines pruning by considering each transformer block’s pruning error and allocating sparsity in a differentiable way, overcoming the perturbations associated with traditional layer-wise approaches. DsnoT³⁴ is also an extension of the SparseGPT and Wanda pruning strategies, introducing a training-free fine-tuning approach that iteratively refines sparse LLMs by adjusting sparse masks, minimizing the reconstruction error between sparse and dense models. Several pruning methods have been developed independently of Wanda and SparseGPT. For example, Flash-LLM²⁸ introduces a “Load-as-Sparse, Compute-as-Dense” strategy, which optimizes memory bandwidth while allowing tensor cores to perform computations as if the model were dense. LoRAPrune³⁵ incorporates LoRA (Low-Rank Adaptation) modules to evaluate the importance of weights and activations, excelling in task-specific pruning scenarios, albeit at the expense of additional computational overhead due to the extra modules. Table 2 summarizes the specific details of these methods.

Table 2 Comparison of pruning algorithms for unstructured pruning in LLMs

Full size table

Quantization and precision-driven compression

Quantization aims to reduce the precision of model parameters, thereby decreasing storage and computational complexity, significantly improving inference efficiency and hardware compatibility. Specifically, quantization converts floating-point values (e.g., FP32, BF16) into fixed-point or integer values (e.g., INT8, FP4), effectively reducing the computational load and memory consumption during inference. Studies have shown that classical models such as AlexNet and ResNet, when quantized to INT8, can still achieve classification accuracy close to floating-point precision on the ImageNet dataset, demonstrating the effectiveness of quantization³⁶.

Quantization fundamentals

Weight Quantization and Activation Quantization

Weight Quantization and Activation Quantization are two fundamental directions in quantization. Weight quantization converts neural network weights from high-precision floating-point numbers to lower-precision integers, reducing storage requirements and significantly lowering inference power consumption. Activation quantization further reduces memory usage and bandwidth requirements by quantizing intermediate activation values. The distribution of weights and activations plays a critical role in determining quantization precision. For instance, many neural networks exhibit normally distributed or sparse weights, enabling effective performance retention even after clipping outliers or redistributing value ranges³⁷.

Symmetric and Asymmetric Quantization

In symmetric quantization, the quantization intervals for weights and activations are symmetric around zero, while asymmetric quantization allows non-symmetric intervals, which are more effective for complex data distributions. For example, the LSQ (Learned Step Size Quantization) method dynamically learns the quantization step size and adjusts strategies based on the actual distribution of weights and activations, thereby improving the adaptability of low-precision quantization³⁸.

Precision restoration

Quantization-Aware Training

Quantization-Aware Training (QAT) is an optimization strategy that introduces simulated quantization noise during training to adapt models to quantization errors. Studies have shown that introducing quantization noise can act as a form of regularization, akin to data augmentation or Dropout, thereby enhancing model robustness³⁹. For instance, simulating quantization errors during training significantly improves a model’s adaptability to low-precision computations in inference⁴⁰. Additionally, HAQ (Hardware-Aware Automated Quantization) uses reinforcement learning to automatically determine the optimal quantization bit-width for each layer, balancing resource utilization and performance⁴¹.

Representative Post-Training Quantization Techniques (PTQ)

PTQ converts pretrained models to low-precision representations through calibration with minimal data, optimizing memory footprint and inference latency. For Transformer architectures, GPTQ¹⁸ pioneered layer-wise 3-4bit quantization through a Hessian-based greedy algorithm that minimizes output reconstruction error. Its optimized implementation achieves full quantization of OPT-175B⁴² in 4.2 GPU hours with a minimal PPL performance loss (1–3%) after 4-bit quantization, enabling single-A800 deployment. Limitations include GPU dependency during quantization and framework-specific format constraints. AWQ¹⁶ offers an adaptive quantization approach that optimizes both weights and activations. By identifying critical weights through activation statistics, AWQ dynamically adjusts quantization granularity. While achieving superior accuracy over GPTQ at equivalent bit-widths, AWQ requires calibration datasets and incurs higher computational overhead. For CPU deployment, GGML introduced SIMD-accelerated low-bit arithmetic via AVX/NEON instructions, later superseded by GGUF’s unified format supporting multi-hardware execution (CUDA/AVX) and enhanced metadata capabilities. GGUF enables extreme compression (1-8bit) with scalable storage, successfully reducing the 671B-parameter DeepSeek-R1 model⁴³ below 140 GB through extreme 1-bit quantization.

Low-rank decomposition for model compression

Low-rank decomposition, as a model compression technique, aims to reduce model size by approximating weight matrices with lower-rank counterparts, leveraging the “algebraic redundancy” in models. Recent advancements in this field address various aspects of model redundancy and computational efficiency. ASVD⁴⁴ addresses the issue of activation distribution variance by transforming the weight matrix based on the activation distribution, thereby allowing outliers in the activation matrix to be absorbed into the transformed weight matrix and improving decomposition accuracy. This method also incorporates an iterative calibration process to optimize layer-specific decomposition, accounting for the varying sensitivity of different LLM layers. LoSparse⁴⁵ introduces a novel approach that approximates a weight matrix as the sum of a low-rank matrix and a sparse matrix. This combines the benefits of both low-rank approximations and pruning, overcoming their individual limitations: low-rank methods can ignore the diversity of neurons, and pruning can remove important neurons under high compression rates. Lillama⁴⁶, on the other hand, observes that while pre-trained Transformer weights are often not inherently low-rank, their activations exhibit low-rank characteristics. It proposes a compression method that locally distills activations with low-rank weights, using SVD for initialization and a joint loss that combines teacher and student activations to accelerate convergence and reduce distillation loss. MoDeGPT⁴⁷ takes a modular decomposition approach, categorizing Transformer layer weight matrices into three functional modules based on their nonlinearity levels and applying specific matrix decomposition algorithms (Nystróm approximation, CR decomposition, and SVD) to each module to ensure bounded errors. This method reduces hidden dimensions through output reconstruction at a larger structural scale, offering a systematic framework for compression. Similarly, SVD-LLM-V2⁹, building on SVD-LLM, addresses weight redundancy heterogeneity by assigning unique compression ratios to each weight matrix based on its theoretical truncation loss. It also refines the weight truncation process by replacing the traditional Cholesky decomposition with two rounds of SVD, ensuring lower and more stable truncation loss in practice, and thereby optimizing the loss in the weight truncation phase.

When compression becomes catastrophic

As compression techniques push LLMs toward their limits, a striking pattern emerges: performance remains remarkably stable through initial compression, only to collapse abruptly once a critical compression threshold is crossed.

Defining model phase transition

Model phase transition refers to the phenomenon observed during the compression and optimization of LLMs, such as pruning and quantization, where the model shifts abruptly from a phase of gradual performance degradation to a phase of rapid and catastrophic collapse. This phase transition occurs in two distinct stages: (1) in the early stages of compression, performance degradation is gradual and controlled, and the model maintains most of its task effectiveness and robustness; (2) as compression intensifies, the model reaches a critical threshold, the Phase Transition Point, beyond which its performance drops sharply, losing both expressive capacity and task adaptability.

We formally define the operational regime prior to this critical threshold as “near-lossless” compression. Functionally, this implies that the degradation in average downstream task metrics remains within an acceptable tolerance (≤5%), ensuring the model’s utility is largely preserved despite parameter reduction. A more direct statistical observable for this stability is the WikiText-2 Perplexity (PPL), where the allowable variation is ~ΔPPL ≈ 1.5 relative to the dense baseline. For instance, empirical data show that LLaMA2-7B maintains stability as its PPL shifts from ~ 5.5 (dense) to ~ 7.0 (at PTP), and similarly, Qwen2.5-7B transitions from ~ 7.9 to ~ 9.2.

This phenomenon is commonly seen across various compression techniques. For example, structured pruning beyond 50% sparsity or unstructured pruning exceeding 70% often leads to sudden model collapse. Similarly, quantization below 3-bit precision typically results in a sharp decline in task performance.

Quantitative phase transition modeling

To characterize the model phase transition phenomenon across compression methods, we introduce an enhanced piecewise function L(s) modeling performance against compression ratio s. This formulation captures both the gradual degradation and catastrophic collapse phases through distinct mathematical regimes, with continuity enforced at the phase transition point s₀:

$$L(s)=\left\{\begin{array}{ll}A\cdot {s}^{\alpha }+B, & s\le {s}_{0}\\ A\cdot {s}_{0}^{\alpha }\cdot \exp \left(\beta (s-{s}_{0})+\gamma {(s-{s}_{0})}^{2}\right), & s > {s}_{0}\end{array}\right.$$

(5)

where s represents the compression ratio (sparsity or quantization precision), s₀ denotes the phase transition point, A and α are power-law parameters governing gradual degradation with B as the performance baseline, while β and γ control exponential collapse dynamics beyond s₀. The formulation ensures C⁰ continuity at s₀ through the shared baseline term B, with the quadratic coefficient γ enabling precise fitting of accelerated collapse rates observed in empirical data.

The function maintains C⁰ continuity at s₀ with $L({s}_{0})=A\cdot {s}_{0}^{\alpha }$. The quadratic term $\gamma {(s-{s}_{0})}^{2}$ enables precise fitting of accelerated collapse rates beyond s₀, addressing limitations of pure exponential decay models. This formulation accurately fits empirical data from thirty compression methods while providing interpretable parameters for phase transition analysis.

PTP in structured pruning

To systematically characterize phase transitions in structured pruning, we reproduced several representative methods using LLaMA2-7B as the unified testbed for cross-method compatibility. Performance was evaluated via perplexity on WikiText-2—a standard language modeling benchmark that faithfully reflects degradation in linguistic structure mastery while ensuring alignment with established research protocols (lower PPL indicates superior performance). Figure 2 compares PPL evolution across sparsity levels, revealing critical trade-offs between compression-induced acceleration and accuracy preservation.

**Fig. 2: Structured pruning phase transition.**

Our experiments demonstrate a consistent phase transition threshold at 30–45% sparsity (Fig. 2). Beyond this inflection point, further compression triggers catastrophic performance collapse, manifested as accelerated PPL degradation. Crucially, structured pruning exhibits significantly lower PTPs than unstructured approaches (detailed in Supplementary Information “Performance and Robustness Under Model Phase Transition”), with most methods tolerating <40% sparsity before collapse. This reduced resilience aligns with structured pruning’s fundamental mechanism: whereas unstructured pruning preserves critical weights through granular removal, structured methods discard entire architectural components (such as attention heads or layers), eliminating vital parameters prematurely. Consequently, performance degradation follows a shallower initial trajectory but reaches collapse thresholds at substantially lower compression intensities.

PTP in unstructured pruning

Recent advancements in unstructured pruning have yielded substantial progress over the past two years. Our systematic evaluation encompasses over a dozen prominent methods applied to the widely supported LLaMA-2-7B model, with perplexity serving as the primary metric for visualizing performance evolution during compression. Similar to structured pruning, the performance-compression curves reveal a definitive model phase transition. Crucially, unstructured pruning exhibits significantly higher PTPs distributed between 0.55–0.65 sparsity (Fig. 3), demonstrating superior compression resilience before collapse compared to structured approaches. This elevated threshold indicates that unstructured pruning can sustain higher compression ratios while maintaining functional integrity.

**Fig. 3: Unstructured pruning phase transition.**

Notably, contemporary research frequently emphasizes performance comparisons at extreme compression rates (70% sparsity), positioning this as a primary differentiator. Our experimental evidence challenges this practice: method divergence remains minimal near the PTP (0.55–0.65), while models subjected to 70% sparsity exhibit complete phase transition collapse, rendering them practically unusable. This finding reveals fundamental limitations in the prevailing research paradigm centered on SparseGPT and Wanda derivatives, indicating that current optimization approaches share identical failure modes and require paradigm-shifting innovations to address the core collapse mechanism.

PTP in quantization

In order to systematically evaluate the impact of model quantization on inference performance, we conducted comprehensive experiments on multiple models quantized via the GGUF framework. These experiments covered progressive quantization from 1-bit to 16-bit precision, focusing on several widely adopted LLM families, including LLaMA-2⁴⁸, Qwen-2.5⁴⁹, and Gemma-3⁵⁰, which exhibit strong performance while covering a diverse range of model scales.

First, we used the WikiText-2 dataset to measure both the perplexity degradation and token generation speed for each model under varying quantization bitwidths and strategies. Our results provide a clear illustration of how quantization levels affect model performance (Table S5). Next, we selected the ARC⁵¹ and MMLU⁵² datasets to evaluate the model’s general knowledge and question-answering capabilities. These datasets allow us to observe the impact of progressive quantization on the accuracy of the model across various sizes. We specifically focused on how the model’s performance evolved during the full-scale quantization process (Fig. 4).

**Fig. 4: Quantized model performance.**

Phase transition point

A consistent phase transition emerges at 3-bit quantization across all model families. Below this threshold, models exhibit catastrophic nonlinear collapse in WikiText-2 perplexity, with Qwen models showing ≤7% degradation at Q3_K_M versus 13-45% at Q2_K. This pattern is reinforced by knowledge-task performance: Qwen2.5-14B suffers 3× greater accuracy loss in MMLU/ARC benchmarks at 2-bit quantization. Identical transitions occur in LLaMA-2 and Gemma families, confirming 3-bit as the universal stability boundary.

Model-scaling effects

Larger models demonstrate significantly higher phase transition resilience. At 2-bit quantization, 70B-class models preserve 94% baseline PPL and 90% MMLU accuracy (Qwen2.5-72B), while sub-10B models suffer at least 30% PPL degradation and 25% MMLU accuracy loss. The delayed collapse in massive models indicates size-dependent redundancy buffers against information loss.

Compression efficiency

Within the stable phase (3-bit and above), quantization achieves 4–5× model compression while preserving 90% baseline performance across all tasks. Below 3-bit, though compression ratios reach 6–8×, catastrophic collapse in both language modeling (PPL) and knowledge tasks (MMLU/ARC) renders models operationally unusable.

PTP in low-rank decomposition

In the domain of LLMs, low-rank decomposition methods inherently offer limited compression ratios compared to pruning or quantization, as weight matrices in contemporary LLMs often exhibit near-full-rank characteristics. To systematically characterize the phase transition behavior in this algebraic dimension, we evaluated five representative low-rank decomposition methods^{44,53,54,55,56} on the LLaMA2-7B model. We applied the same piecewise power-law-exponential fitting methodology to pinpoint their critical thresholds.

As illustrated in Fig. 5, the performance trajectories reveal a bifurcation into two distinct phase transition regimes, differentiated by their decomposition objectives. Mode I (Weight-Dominant): Approaches prioritizing static weight reconstruction, exemplified by SFSD and ASVD, encounter premature capability collapse, with PTPs confined to the low range of 16.3%–18.7%. This empirically validates that the intrinsic algebraic redundancy of static weight matrices is critically low, limiting the effectiveness of direct spectral truncation. Mode II (Activation-Centric): Conversely, strategies that leverage the low-rank geometry of the activation space (FLAT-LLM, SoLA) or incorporate truncation-aware compensation (SVD-LLM) demonstrate significantly enhanced robustness. These methods extend the stability frontier to 28.0%–40.0% sparsity, with FLAT-LLM achieving the upper bound. This divergence underscores that while weight matrices approximate full rank, the feature manifold remains highly compressible. Below these thresholds, perplexity degradation is manageable; however, crossing them triggers an immediate and sharp exponential rise in PPL.

**Fig. 5: Low-rank decomposition phase transition.**

Phase transitions in combined model compression

Combined model compression refers to the application of multiple compression strategies to aggressively reduce the size of a model, achieving higher compression rates. Currently, the mainstream model compression techniques are broadly classified into three categories. The first category focuses on removing unimportant parameters or neurons, which are represented by structured and unstructured pruning. The second category reduces the bit-width or precision of existing parameters, with model quantization being the primary approach. The third category employs matrix decomposition to reduce the number of parameters, exemplified by low-rank factorization. These three techniques address different aspects of model redundancy, including structural, numerical, and algebraic, each capturing different forms of “model redundancy,” which often coexist in deep neural networks. In other words, a model can simultaneously be sparse, represented by fewer bits, and approximated in a lower-rank subspace. Thus, for large models, combined compression can be viewed as imposing “information bottlenecks” at multiple levels, forcing the model to retain only the most crucial information. This approach can theoretically achieve higher compression rates at the PTP, where the model’s performance begins to degrade rapidly.

We combined several well-performing compression methods from different categories and analyzed their effects on the LLaMA2-7B model. Figure 6 shows the phase transition curve under the synergistic application of Wanda pruning²⁷ and GGUF quantization. The left plot displays the PPL surface for the combined compression, while the right plot shows the contour plot of the same surface. From the left plot, it is evident that pruning has a more significant impact on model performance. Additionally, the combination of pruning and quantization does not substantially affect the phase transition point for either individual compression method (the critical thresholds for pruning and quantization remain around 55% sparsity and 2-bit precision, respectively). The orange star-shaped curve on the right plot represents the model’s phase transition curve, while the red line represents the “cost-effective” curve, showing the lowest PPL for a given compression/memory budget. The intersection of these two curves marks the model’s compression limit, considering the loss of accuracy. In the combined Wanda and GGUF approach, this limit is ~11%, meaning the model can be compressed to ~10% of its original size without significant performance degradation. Other combinations, such as SparseGPT coupled with GPTQ^18,29, achieved an extreme retention rate of 12% (pruning sparsity 60% coupled with INT4 quantization, PPL = 8.4), while ADMM integrated with GGUF reached a retention rate of 9% (pruning sparsity 60% coupled with 3-bit quantization, PPL = 7.06)³¹.

**Fig. 6: Combined pruning and quantization.**

Beyond the remarkable performance of pruning-quantization hybrids, we further explored the interaction between algebraic and numerical redundancies. As shown in Fig. 7, the experimental results reveal a clear stability hierarchy, indicating that compression methods with deeper PTPs (higher robustness) naturally dominate the effective compression space. Specifically, quantization acts as the primary driver of compression due to its superior robustness. Our analysis suggests a hierarchical intervention logic: Quantization is prioritized initially. Only when the quantization compression rate reaches approximately 60% (6-bit) should unstructured pruning methods be introduced to further reduce model size. Furthermore, low-rank decomposition and aggressive pruning ratios should only be considered when quantization approaches its critical PTP limit (76% compression at 3-bit). This sequential activation, exhausting the safe zone of the most robust method before engaging the next, maximizes the compression ratio while maintaining functional integrity, providing the empirical basis for the systematic framework introduced in the next section. These combined compression ceilings align closely with the phase transition thresholds of their constituent methods. For integration with pruning, LoSparse successfully incorporates CoFi’s structured pruning framework, mitigating limitations inherent to standalone approaches⁵⁷.

**Fig. 7: Combined low-rank decomposition and quantization.**

In conclusion, by observing the phase transition points for various single compression techniques, we can quickly deduce the theoretical compression limit for combined compression before the model undergoes catastrophic degradation. This insight allows for optimizing deployment strategies and minimizing model size while maintaining sufficient performance. For instance, a model that could previously be deployed with a 16 GB memory budget, such as the original LLaMA2-7B, can now be deployed with an extreme compression version of LLaMA2-70B.

Criticality-aware compression framework

We propose a criticality-aware compression framework to address the limitations of ad-hoc compression combinations and provide rigorous guidelines for deployment. This framework fundamentally reframes model compression from an empirical trial-and-error process into a structured trajectory planning problem within a multi-dimensional phase space. By characterizing the critical stability boundaries of the model, we employ a phase avoidance strategy to identify the minimum energy path for optimal compression.

Theoretical orthogonality

The feasibility of our framework is grounded in the orthogonality of compression mechanisms (detailed in Supplementary Information “Orthogonality of Compression Mechanisms”). Since pruning, quantization, and low-rank decomposition target disjoint redundancy subspaces (Spatial, Numerical, and Algebraic), their induced errors are statistically additive rather than multiplicative. This orthogonality implies that applying one method does not significantly shift the Phase Transition Points (PTPs) of others. Consequently, the phase space of a model can be defined as a hyper-rectangle bounded by the individual PTPs of each method. Within this bounded region, the interaction effects are minimal, allowing for predictable performance behavior.

Phase avoidance strategy via trajectory planning

We formalize the phase avoidance strategy not merely as a heuristic, but as a constrained trajectory optimization problem on the model’s potential energy surface. Here, we define the model’s perplexity ${\mathcal{L}}({\mathcal{C}})$ as the potential energy of the system state ${\mathcal{C}}$.

The geometry of degradation

The multi-dimensional phase space is topologically partitioned into two distinct regions by the critical thresholds of each method. The first is the region of graceful degradation (${{\mathcal{S}}}_{safe}$), defined as the subspace where the loss function ${\mathcal{L}}$ exhibits convex or linear behavior with respect to the compression ratio. Mathematically, this corresponds to the regime where the perturbation δ introduced by compression satisfies ${\mathcal{L}}(\theta +\delta )\approx {\mathcal{L}}(\theta )+\nabla {{\mathcal{L}}}^{T}\delta$, and higher-order derivatives are negligible. In this region, capability loss is predictable and recoverable. The boundary of this region is the event horizon ($\partial {\mathcal{S}}$), formed by the union of individual phase transition points (PTPs): $\partial {\mathcal{S}}=\{{\mathcal{C}}| s=PT{P}_{prune}\cup b=PT{P}_{quant}\cup r=PT{P}_{rank}\}$. Crossing this boundary drives the system into a chaotic regime where the Hessian spectrum of the loss function undergoes catastrophic changes, leading to exponential performance collapse.

Minimum energy path optimization

The goal of the phase avoidance strategy is to navigate from the dense state to a target compressed state along a minimum energy path. Unlike standard optimization, which seeks a local minimum, this process seeks a trajectory ${\mathcal{T}}$ that maximizes compression while keeping the system’s potential energy (PPL) minimal and strictly within ${{\mathcal{S}}}_{safe}$.

Let ${\mathcal{C}}=(s,b,r)$ be the configuration state vector representing sparsity, bit-width, and rank reduction. The compression problem is formulated as finding the optimal configuration ${{\mathcal{C}}}^{* }$ that minimizes model size subject to stability constraints:

$$\begin{array}{l}\mathop{\min }\limits_{{\mathcal{C}}}\,Size({\mathcal{C}})\\ \,{\rm{s.t.}}\,\,{\mathcal{L}}({\mathcal{C}})-{{\mathcal{L}}}_{base}\le \epsilon \,(\,{\rm{Near}}\; -\; {\rm{Lossless}}\; {\rm{Constraint}})\\ {\mathcal{C}}\in {{\mathcal{S}}}_{safe}\,\,\iff \,\,\{s < PT{P}_{prune},\,b > PT{P}_{quant},\,r > PT{P}_{rank}\}\end{array}$$

(6)

By treating the PTPs as hard constraints (the event horizon), the solver is forced to exploit the redundancy dimensions with the shallowest energy gradients (highest robustness) first, naturally deriving the sequential activation strategy described in Fig. 8.

**Fig. 8: Phase avoidance strategy in multi-dimensional compression space.**

Heuristic guidelines for compression

Based on our extensive empirical analysis of PTPs and the trajectory shown in Fig. 8, we synthesize the heuristic guidelines for optimal compression planning.

Priority by robustness

Our analysis reveals a clear hierarchy in redundancy robustness: Numerical > Structural > Algebraic. Numerical redundancy, exploited by quantization, exhibits the deepest PTP, remaining robust down to 3-bit precision. Structural redundancy (pruning) follows, tolerating up to ~ 50% sparsity. Algebraic redundancy (decomposition) is the least robust, with PTPs often occurring ~20–30% removal. Consequently, quantization should serve as the primary driver of compression.

Sequential activation

The optimal trajectory suggests a sequential activation strategy aligned with the robustness hierarchy. Quantization is prioritized initially to reduce model size rapidly. Unstructured pruning is introduced only when quantization reaches saturation (such as ~ 60% rate). Low-rank decomposition acts as the final lever, activated only as quantization approaches its critical PTP limit (such as 76% compression at Q3_K_M). This staged approach ensures that the most stable redundancy sources are exhausted before engaging more sensitive ones.

Engineering execution order

Crucially, we distinguish between planning priority and execution sequence. While quantization takes precedence in budget allocation, it must occur last in the actual deployment pipeline (i.e., Decomposition → Pruning → Quantization). Quantization is an irreversible operation that introduces noise and discretizes the optimization landscape; therefore, structural changes (pruning and decomposition) must be performed on high-precision weights first to ensure the accuracy of importance calculations, with quantization applied effectively as a final encapsulation step.

Validation and perspectives

To validate the Phase Avoidance Strategy, we conducted a comparative analysis focusing on the “Compress Big” versus “Native Small” hypothesis. We compared the LLaMA-2-7B model compressed using our PTP-guided framework against natively trained small models (LLaMA-3.2-1B) and newer generation models of similar size (Compressed LLaMA-3.1-8B).

Visualization of the optimal trajectory

Figure 8 visualizes the actual compression path taken for the LLaMA-2-7B experiment. The trajectory strictly adheres to the safe zones defined by the PTPs of Quantization (GGUF), Pruning (ADMM), and Decomposition (ASVD). The final operating point, marked by the star, corresponds to the LLaMA2-7B-PTP model in Table 3. This point represents a sophisticated equilibrium:

Table 3 Comprehensive comparison of criticality-aware compression vs. OOPTP baselines and native small models

Full size table

Quantization: 3-bit (Q3_K_M), contributing ~ 76% compression.

Pruning: 35% unstructured sparsity, further reducing redundancy without breaking structural integrity.

Decomposition: 5% rank reduction, shaving off the final algebraic redundancy.

This compound configuration yields an 85% total compression rate (1.89 GB final size) while maintaining a PPL of 6.92, demonstrating the efficacy of avoiding single-dimension collapse.

Evaluation benchmarks

We employed a diverse set of benchmarks to rigorously assess model capabilities across language modeling, reasoning, and generation quality. Perplexity (PPL) on WikiText-2 served as the primary indicator of language modeling stability. For reasoning and knowledge, we utilized ARC (Challenge and Easy)⁵¹, PIQA⁵⁸, Winogrande⁵⁹, HellaSwag⁶⁰, and BoolQ⁶¹ to evaluate common-sense reasoning and factual accuracy. Generation Quality was assessed using ROUGE-1/2/L⁶² scores on the CNN/DailyMail⁶³ dataset to measure information overlap and fluency, alongside BERTScore⁶⁴ to evaluate semantic coherence. This comprehensive suite ensures that our compression strategy preserves not just statistical patterns but also the emergent cognitive abilities of LLMs.

PTP-guided compression performance

Table 3 presents the comprehensive results, from which we highlight three key observations regarding the efficacy of our framework.

Superiority of phase avoidance (Group 1)

The LLaMA2-7B-chat-PTP model (Combined Compression) achieves a compact size of 1.89 GB (85% compression ratio) by strictly adhering to the safe zones of all three methods. Despite being smaller than the single-method aggressive baselines, such as Q2_K at 2.36 GB or ADMM at 4.39 GB, it maintains a Wikitext-2 PPL of 6.91. This performance significantly outstrips the collapsed “Out Of Phase Transition Point” (OOPTP) baselines, where ADMM degrades to a PPL of 9.54 and ASVD to 8.59. These results confirm that avoiding the phase transition in multiple dimensions yields superior retention of model capabilities compared to pushing a single dimension to its breaking point.

The “Compress Big” advantage (Group 2)

A critical finding emerges from the comparison between the compressed LLaMA2-7B-chat-PTP (1.89 GB) and the natively trained LLaMA-3.2-1B-Instruct (2.3 GB). Despite being 18% smaller in storage, the compressed 7B model significantly outperforms the native 1B model across almost all benchmarks. For instance, it achieves an ARC-C score of 55.0 compared to 45.0 for the native model, and a BERTScore of 87.06 versus 73.21. This challenges the prevailing industry trend of training small models from scratch, suggesting that compressing larger models allows for the retention of complex “world model” features that small models simply never acquire during pre-training.

Generational robustness (Group 3)

Applying our framework to the newer LLaMA-3.1-8B, we successfully compressed it to 2.4 GB, matching the size of the 1B model. This compressed model achieves state-of-the-art performance for its size class, with an ARC-C score of 72.0 and strong MMLU-implied capabilities. This result further validates the universality of the Phase Avoidance Strategy, demonstrating its applicability and robustness across different model generations.

Perspectives: rethinking efficient AI

Based on the Criticality-Aware Framework and our experimental results, we propose four perspectives to guide future efficient AI development:

(1)
The Illusion of Scale: Existing LLM architectures exhibit an illusion where parameter count is conflated with capability. Our results show that at least 90% of the parameters in current dense models (like LLaMA-2 and LLaMA-3) are redundant for inference, serving primarily as a scratchpad for optimization during training.
(2)
Superiority of Compression over Ab Initio Training: We advocate for a paradigm shift from training small models from scratch to compressing large pre-trained models. Large-scale architectures possess the capacity to capture complex, high-dimensional feature representations during pre-training that smaller architectures inherently fail to acquire. Our framework demonstrates that PTP-guided compression preserves these sophisticated representations within a reduced memory footprint, yielding reasoning capabilities significantly superior to those of native models with comparable sizes.
(3)
Maximizing Information Density for Edge Deployment: In resource-constrained environments (such as edge devices), the industry intuition is often to select a native small model. We argue this is suboptimal. The information density of a compressed large model far exceeds that of a native small model. The golden rule for deployment should be: Always train the largest possible model, then compress it to the target budget using Phase Avoidance.
(4)
The Event Horizon of Capability: The catastrophic failure of OOPTP models (detailed in Table 3) illustrates that the Phase Transition Point is not merely a performance dip, but an event horizon of model capability. Beyond this critical threshold, the model does not just get weaker; it undergoes a qualitative collapse, losing the emergent abilities that define LLMs. Respecting this horizon is the fundamental constraint of efficient AI.

Conclusions and outlook

This paper systematically revisits the phenomenon of model phase transition, where LLMs transition from controlled performance degradation to catastrophic collapse under progressive compression. Our Perspective integrates theoretical insights, experimental findings, and future research directions. Below, we summarize key discoveries and outline promising research trajectories.

The fundamental limits of compression

Model phase transition theory reveals that compression boundaries are fundamentally governed by critical phase transition points. Our comprehensive analysis establishes distinct PTP distributions across compression paradigms: pruning (65% sparsity for unstructured, 45% for structured), quantization (3-bit precision, equivalent to 77% compression), and low-rank decomposition (30% sparsity). These thresholds originate from three orthogonal redundancy mechanisms—structural, numerical, and algebraic—that collectively constitute the theoretical foundation of phase transitions. Crucially, the orthogonal nature of these redundancies ensures PTP stability under combined compression strategies.

Theoretical framework and deployment implications

Our piecewise power-law-exponential formulation quantitatively models performance-compression curves across methodologies. Beyond identifying Pareto-optimal compression ratios at PTPs, this formalism enables performance prediction under arbitrary memory constraints. By converting the compression problem into a trajectory planning task within the phase space, our Criticality-Aware Framework provides a methodological guarantee for the “Deployment Golden Rule” proposed in “Validation and Perspectives,” enabling near-lossless compression down to 10% of the original model size.

Compute-optimal in LLMs deployment

While recent training literature advocates Compute-Optimal LLMs⁶⁵ we argue deployment efficiency demands analogous optimization. Current state-of-the-art quantization achieves 80% compression with minimal accuracy loss but remains suboptimal. Future work should pursue hybrid compression, synergistically combining pruning’s structural elimination, quantization’s precision reduction, and decomposition’s rank truncation, to transcend existing PTP limits. Additionally, inference-phase optimizations like KV cache compression warrant equal consideration alongside weight-level compression.

Sustainable AI development

As the industry confronts the impending “Data Wall” and the diminishing marginal returns of purely scaling compute, the trajectory of AI development is shifting from the brute-force “Age of Scaling” to a nuance-driven “Age of Research.” In this new paradigm, the challenge is no longer how much compute can be deployed, but how intelligently it can be utilized. Echoing the “Illusion of Scale,” our MPT framework underscores that parameter redundancy is a strategic resource to be mined, not a burden to be carried. Future advancements must prioritize compression efficiency and architectural elegance, maximizing the computational value per parameter to achieve sustainable, high-density intelligence.

Data availability

Data are provided within the manuscript.

References

Vaswani, A. et al. Attention is all you need. In Adv. Neural Inf. Process. Syst. 30, 5998–6008 (2017).
Devlin, J., Chang, M.-W., Lee, K. & Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proc. 2019 Conf. N. Am. Chapter Assoc. Comput. Linguist. Hum. Lang. Technol. 4171–4186 (ACL, 2019).
Brown, T. et al. Language models are few-shot learners. In Adv. Neural Inf. Process. Syst. 33, 1877–1901 (2020).
Chowdhery, A. et al. Palm: scaling language modeling with pathways. J. Mach. Learn. Res. 24, 1–113 (2023).
Google Scholar
Patterson, D. et al. The carbon footprint of machine learning training will plateau, then shrink. Computer 55, 18–28 (2022).
Article Google Scholar
Kaplan, J. et al. Scaling laws for neural language models. Preprint at https://doi.org/10.48550/arXiv.2001.08361 (2020).
Ding, N. et al. Parameter-efficient fine-tuning of large-scale pre-trained language models. Nat. Mach. Intell. 5, 220–235 (2023).
Article Google Scholar
Yao, Y. et al. Efficient gpt-4v level multimodal large language model for deployment on edge devices. Nat. Commun. 16, 5509 (2025).
Article Google Scholar
Wang, X., Alam, S., Wan, Z., Shen, H. & Zhang, M. SVD-LLM V2: optimizing singular value truncation for large language model compression. In Proc. 2025 Conf. Nations Am. Chapter Assoc. Comput. Linguist. Hum. Lang. Technol. 4287–4296 (ACL, 2025).
Frankle, J. & Carbin, M. The lottery ticket hypothesis: Finding sparse, trainable neural networks. In Proc. 7th Int. Conf. Learn. Represent. 8954–8995 (OpenReview, 2019).
Morcos, A., Yu, H., Paganini, M. & Tian, Y. One ticket to win them all: generalizing lottery ticket initializations across datasets and optimizers. In Adv. Neural Inf. Process. Syst. 32, 4932–4942 (2019).
Rushing, C. & Nanda, N. Explorations of self-repair in language models. In Proc. 41st Int. Conf. Mach. Learn. 42836–42855 (PMLR, 2024).
Raman, R., Sharma, K. & Zhang, S. Q. Rethinking the outlier distribution in large language models: An in-depth study (2025).
Dettmers, T., Lewis, M., Belkada, Y. & Zettlemoyer, L. Gpt3. int8 (): 8-bit matrix multiplication for transformers at scale. In Adv. Neural Inf. Process. Syst. 35, 30318–30332 (2022).
An, Y., Zhao, X., Yu, T., Tang, M. & Wang, J. Systematic outliers in large language models. In Proc. 13th Int. Conf. Learn. Represent. 1–22 (OpenReview, 2025).
Lin, J. et al. Awq: Activation-aware weight quantization for on-device llm compression and acceleration. Proc. Mach. Learn. Syst. 6, 87–100 (2024).
Google Scholar
Xiao, G. et al. Smoothquant: Accurate and efficient post-training quantization for large language models. In Proc. 40th Int. Conf. Mach. Learn. 38087–38099 (PMLR, 2023).
Frantar, E., Ashkboos, S., Hoefler, T. & Alistarh, D. Gptq: Accurate post-training quantization for generative pre-trained transformers. In Proc. 11th Int. Conf. Learn. Represent. 1–16 (OpenReview, 2023).
Yu, H. & Wu, J. Compressing transformers: features are low-rank, but weights are not! In Proc. AAAI Conf. Artif. Intell. 37, 11007–11015 (AAAI, 2023).
Zhu, X., Li, J., Liu, Y., Ma, C. & Wang, W. A survey on model compression for large language models. Trans. Assoc. Comput. Linguist. 12, 1556–1577 (2024).
Article Google Scholar
An, Y., Zhao, X., Yu, T., Tang, M. & Wang, J. Fluctuation-based adaptive structured pruning for large language models. In Proc. AAAI Conf. Artif. Intell. 38, 10865–10873 (AAAI, 2024).
Men, X. et al. ShortGPT: Layers in Large Language Models are More Redundant Than You Expect. In Findings Assoc. Comput. Linguist. 20192–20204 (ACL, 2025).
Xia, M., Gao, T., Zeng, Z. & Chen, D. Sheared LLaMA: Accelerating Language Model Pre-training via Structured Pruning. In Proc. 12th Int. Conf. Learn. Represent. 1–22 (OpenReview, 2024).
Gromov, A., Tirumala, K., Shapourian, H., Glorioso, P. & Roberts, D. The unreasonable ineffectiveness of the deeper layers. In Workshop NeurIPS 2024 1–22 (OpenReview, 2024).
Ma, X., Fang, G. & Wang, X. Llm-pruner: on the structural pruning of large language models. In Adv. Neural Inf. Process. Syst. 36, 21702–21720 (2023).
Song, J. et al. Sleb: streamlining llms through redundancy verification and elimination of transformer blocks. In Proc. 41st Int. Conf. Mach. Learn. 46136–46155 (PMLR, 2024).
Sun, M., Liu, Z., Bair, A. & Kolter, J. Z. A simple and effective pruning approach for large language models. In Proc. 12th Int. Conf. Learn. Represent. 1–23 (OpenReview, 2024).
Xia, H. et al. Flash-llm: Enabling cost-effective and highly-efficient large generative model inference with unstructured sparsity. Proc. VLDB Endow. 17, 211–224 (2023).
Article Google Scholar
Frantar, E. & Alistarh, D. Sparsegpt: Massive language models can be accurately pruned in one-shot. In Proc. 40th Int. Conf. Mach. Learn. 10323–10337 (PMLR, 2023).
Zhang, Y. et al. Plug-and-play: an efficient post-training pruning method for large language models. In Proc. 13th Int. Conf. Learn. Represent. 1–20 (OpenReview, 2024).
Boža, V. Fast and effective weight update for pruned large language models. Trans. Mach. Learn. Res. https://doi.org/10.48550/arXiv.2401.02938 (2024).
Yin, L. et al. Outlier weighed layerwise sparsity (owl): a missing secret sauce for pruning llms to high sparsity. In Proc. 41st Int. Conf. Mach. Learn. 57101–57115 (PMLR, 2024).
Xu, P. et al. Besa: pruning large language models with blockwise parameter-efficient sparsity allocation. In Proc. 13th Int. Conf. Learn. Represent. 1–15 (OpenReview, 2024).
Zhang, Y. et al. Dynamic sparse no training: training-free fine-tuning for sparse llms. In Proc. 12th Int. Conf. Learn. Represent. 1–15 (OpenReview, 2024).
Zhang, M. et al. LoRAPrune: Structured Pruning Meets Low-Rank Parameter-Efficient Fine-Tuning. In Findings Assoc. Comput. Linguist. 3013–3026 (ACL, 2024).
Jacob, B. et al. Quantization and training of neural networks for efficient integer-arithmetic-only inference. In Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. 2704–2713 (IEEE, 2018).
Yao, Z. et al. Zeroquant: efficient and affordable post-training quantization for large-scale transformers. In Adv. Neural Inf. Process. Syst. 35, 27168–27183 (2022).
Esser, S. K., McKinstry, J. L., Bablani, D., Appuswamy, R. & Modha, D. S. Learned step size quantization. In Proc. 8th Int. Conf. Learn. Represent. 6910–6921 (OpenReview, 2020).
Liu, Z. et al. Llm-qat: data-free quantization aware training for large language models. In Findings Assoc. Comput. Linguist. ACL 2024 467–484 (ACL, 2024).
Li, M. et al. Contemporary advances in neural network quantization: a survey. In 2024 Int. Joint Conf. Neural Netw. 1–10 (IEEE, 2024).
Wang, K. et al. Haq: hardware-aware automated quantization with mixed precision. In Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. 8612–8620 (IEEE, 2019).
Zhang, S. et al. Opt: open pre-trained transformer language models. Preprint at https://doi.org/10.48550/arXiv.2205.01068 (2022).
Guo, D. et al. Deepseek-r1: incentivizing reasoning capability in llms via reinforcement learning. Preprint at https://doi.org/10.48550/arXiv.2501.12948 (2025).
Yuan, Z. et al. Asvd: activation-aware singular value decomposition for compressing large language models. Preprint at https://doi.org/10.48550/arXiv.2312.05821 (2023).
Li, Y. et al. Losparse: structured compression of large language models based on low-rank and sparse approximation. In Proc. 40th Int. Conf. Mach. Learn. 20336–20350 (PMLR, 2023).
Sy, Y., Cerisara, C. & Illina, I. Efficient One-shot Compression via Low-Rank Local Feature Distillation. In Proc. 2025 Conf. Nations Am. Chapter Assoc. Comput. Linguist. Hum. Lang. Technol. 5643–5661 (ACL, 2025).
Lin, C. et al. Modegpt: Modular decomposition for large language model compression. In Proc. 13th Int. Conf. Learn. Represent. 1–36 (OpenReview, 2025).
Touvron, H. et al. Llama 2: Open foundation and fine-tuned chat models. Preprint at https://doi.org/10.48550/arXiv.2307.09288 (2023).
Yang, A. et al. Qwen2 technical report. Preprint at https://doi.org/10.48550/arXiv.2407.10671 (2024).
Kamath, A. et al. Gemma 3 technical report. Preprint at https://doi.org/10.48550/arXiv.2503.19786 (2025).
Clark, P. et al. Think you have solved question answering? try arc, the ai2 reasoning challenge. Preprint at https://doi.org/10.48550/arXiv.1803.05457 (2018).
Hendrycks, D. et al. Measuring massive multitask language understanding. In Proc. 9th Int. Conf. Learn. Represent. 1–27 (OpenReview, 2021).
Wang, X., Zheng, Y., Wan, Z. & Zhang, M. SVD-LLM: Truncation-aware singular value decomposition for large language model compression. In Proc. 13th Int. Conf. Learn. Represent. 1–21 (OpenReview, 2025).
Chavan, A., Lele, N. & Gupta, D. Surgical feature-space decomposition of llms: Why, when and how? In Proc. 62nd Annu. Meet. Assoc. Comput. Linguist. ACL 2024 2389–2400 (ACL, 2024).
Huang, X., Huang, Y.-L. & Wen, Z. Sola: Leveraging soft activation sparsity and low-rank decomposition for large language model compression. In Proc. AAAI Conf. Artif. Intell. 39, 17494–17502 (AAAI, 2025).
Tian, J. et al. Flat-llm: Fine-grained low-rank activation space transformation for large language model compression (2025).
Xia, M., Zhong, Z. & Chen, D. Structured pruning learns compact and accurate models. In 60th Annu. Meet. Assoc. Comput. Linguist. ACL 2022 1513–1528 (ACL, 2022).
Bisk, Y. et al. Piqa: Reasoning about physical commonsense in natural language. In Proc. AAAI Conf. Artif. Intell. 34, 7432–7439 (2020).
Sakaguchi, K., Bras, R. L., Bhagavatula, C. & Choi, Y. Winogrande: an adversarial winograd schema challenge at scale. Commun. ACM 64, 99–106 (2021).
Article Google Scholar
Zellers, R., Holtzman, A., Bisk, Y., Farhadi, A. & Choi, Y. Hellaswag: Can a machine really finish your sentence? In Proc. 57th Annu. Meet. Assoc. Comput. Linguist. ACL 2019 (ACL, 2019).
Clark, C. et al. Boolq: Exploring the surprising difficulty of natural yes/no questions. In Proc. 2019 Conf. N. Am. Chapter Assoc. Comput. Linguist. Hum. Lang. Technol. 2924–2936 (2019).
LIN, C. Rouge: A package for automatic evaluation of summaries. In Proc. Workshop Text Summariation Branches Out, Post-Conf. Workshop ACL 2004 (2004).
Hermann, K. M. et al. Teaching machines to read and comprehend. In Adv. Neural Inf. Process. Syst. 28 (2015).
Zhang, T., Kishore, V., Wu, F., Weinberger, K. Q. & Artzi, Y. Bertscore: Evaluating text generation with bert. In Proc. 8th Int. Conf. Learn. Represent (OpenReview, 2020).
Hoffmann, J. et al. An empirical analysis of compute-optimal large language model training. In Adv. Neural Inf. Process. Syst. 35, 30016–30030 (2022).
Kim, B.-K. et al. Shortened LLaMA: A Simple Depth Pruning for Large Language Models. In Workshop on Mathematical and Empirical Understanding of Foundation Models, ICLR 2024, 1–12, https://doi.org/10.48550/arXiv.2402.02834 (OpenReview, 2024).
Ashkboos, S. et al. SliceGPT: Compress Large Language Models by Deleting Rows and Columns. In Proc. 12th Int. Conf. Learn. Represent. 1–25 (OpenReview, 2024).

Download references

Acknowledgements

This work was supported by the National Natural Science Foundation of China (Grant no. 62306216), the Technology Innovation Program of Hubei Province (Grant no. 2024BAB043), and the Fundamental Research Funds for the Central Universities (Grant No. 2042025kf0026). Dr. Tao’s research is partially supported by NTU RSR and Start-Up Grants.

Author information

Authors and Affiliations

School of Computer Science, Wuhan University, Wuhan, China
Ziyang Ma, Lefei Zhang & Bo Du
School of Artificial Intelligence, Wuhan University, Wuhan, China
Zuchao Li & Gui-Song Xia
State Key Laboratory of Information Engineering in Surveying, Mapping and Remote Sensing, Wuhan University, Wuhan, China
Liangpei Zhang
Nanyang Technological University, School of Computer Science and Engineering, Singapore, Singapore
Dacheng Tao

Authors

Ziyang Ma
View author publications
Search author on:PubMed Google Scholar
Zuchao Li
View author publications
Search author on:PubMed Google Scholar
Lefei Zhang
View author publications
Search author on:PubMed Google Scholar
Gui-Song Xia
View author publications
Search author on:PubMed Google Scholar
Bo Du
View author publications
Search author on:PubMed Google Scholar
Liangpei Zhang
View author publications
Search author on:PubMed Google Scholar
Dacheng Tao
View author publications
Search author on:PubMed Google Scholar

Contributions

Z.M., Z.L., and L.F.Z. conceptualized the manuscript. Z.M. and Z.L. wrote the initial draft. L.P.Z., G.X., B.D., L.P.Z., and D.T. contributed to the major manuscript revisions. All authors discussed, edited and approved the manuscript.

Corresponding authors

Correspondence to Zuchao Li or Lefei Zhang.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary information (download PDF )

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Ma, Z., Li, Z., Zhang, L. et al. Phase transitions in large language model compression. npj Artif. Intell. 2, 21 (2026). https://doi.org/10.1038/s44387-026-00072-8

Download citation

Received: 10 August 2025
Accepted: 05 January 2026
Published: 06 February 2026
Version of record: 06 February 2026
DOI: https://doi.org/10.1038/s44387-026-00072-8

Subjects

Abstract

Similar content being viewed by others

Efficient self-attention with smart pruning for sustainable large language models

Testing AI on language comprehension tasks reveals insensitivity to underlying meaning

Revealing the intrinsic ethical vulnerability of aligned large language models

Introduction

Overview of compression techniques

Redundancy as the foundation of phase transitions

Structural redundancy

Numerical redundancy

Algebraic redundancy

Pruning-induced model compression

Structured pruning

Unstructured pruning

Quantization and precision-driven compression

Quantization fundamentals

Weight Quantization and Activation Quantization

Symmetric and Asymmetric Quantization

Precision restoration

Quantization-Aware Training

Representative Post-Training Quantization Techniques (PTQ)

Low-rank decomposition for model compression

When compression becomes catastrophic

Defining model phase transition

Quantitative phase transition modeling

PTP in structured pruning

PTP in unstructured pruning

PTP in quantization

Phase transition point

Model-scaling effects

Compression efficiency

PTP in low-rank decomposition

Phase transitions in combined model compression

Criticality-aware compression framework

Theoretical orthogonality

Phase avoidance strategy via trajectory planning

The geometry of degradation

Minimum energy path optimization

Heuristic guidelines for compression

Priority by robustness

Sequential activation

Engineering execution order

Validation and perspectives

Visualization of the optimal trajectory

Evaluation benchmarks

PTP-guided compression performance

Superiority of phase avoidance (Group 1)

The “Compress Big” advantage (Group 2)

Generational robustness (Group 3)

Perspectives: rethinking efficient AI

Conclusions and outlook

The fundamental limits of compression

Theoretical framework and deployment implications

Compute-optimal in LLMs deployment

Sustainable AI development

Data availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Competing interests

Additional information

Supplementary information

Supplementary information (download PDF )

Rights and permissions

About this article

Cite this article

Share this article

Search

Quick links