ROSE: Reordered SparseGPT for More Accurate One-Shot Large Language Models Pruning

Su, Mingluo; Wang, Huan

ROSE: Reordered SparseGPT for More Accurate One-Shot Large Language Models Pruning

CPAL 2026

Mingluo Su¹, Huan Wang^1*

¹Westlake University

^*Corresponding author: wanghuan [at] westlake [dot] edu [dot] cn

Paper Code

Abstract

Pruning is widely recognized as an effective method for reducing the parameters of large language models (LLMs), potentially leading to more efficient deployment and inference. One classic and prominent path of one-shot LLM pruning is to leverage second-order gradients (i.e., Hessian), represented by the pioneering work SparseGPT. However, the predefined left-to-right pruning order in SparseGPT leads to suboptimal performance when the weights exhibit columnar patterns. This paper studies the effect of pruning order under the SparseGPT framework. The analyses lead us to propose ROSE, a reordered SparseGPT method that prioritizes weight columns with larger potential pruning errors to be processed earlier. ROSE first performs pre-pruning to identify weights that are highly likely to be pruned, and estimates both column-wise and block-wise pruning loss. The relative range of block loss is used as a metric to identify columnar layers and perform adaptive reordering for them. For the reordering operation, columns within each block are reordered in descending order of column loss, while blocks are reordered in descending order of block loss. Substantial empirical results on prevalent LLMs (LLaMA2-7B/13B/70B, LLaMA3-8B, Mistral-7B) demonstrate that ROSE surpasses the original SparseGPT and other counterpart pruning methods.

Motivation

(a) Change in reconstruction error of the "self\_attn.o\_proj" layer in the first Transformer Block of LLaMA2-7B during SparseGPT pruning as the number of pruned blocks increases. The sharpest increase appears at a later stage. (b) Weight visualization shows a columnar pattern along the input channel, with one block containing the most concentrated high-magnitude weights. (c) Reconstruction error after reordering: pruning the high-error block earlier yields lower total error.

Overview of our ROSE method

(a) Illustration of difference between SparseGPT and ROSE. In SparseGPT, as pruning progresses, fewer weights remain for compensation. If high-error weights are pruned late, compensation is limited. ROSE reorders columns so high-error ones are pruned early, preserving more parameters for compensation. (b) ROSE workflow for a columnar layer: given dense weight W, compute importance scores S, split into blocks, select smallest p% per block as loss matrix, compute column/block losses, then reorder columns (within block) and blocks in descending loss order.

Main Results

Relative reconstruction error of the "self\_attn.o\_proj" layer in the second Transformer Block of LLaMA2-7B by ROSE and its variants at varying sparsity rates.

WikiText perplexity performance on LLaMA3-8B and Mistral-7B at varying sparsity rates. The best results are highlighted in bold and the second-best results are indicated with underline.

WikiText perplexity and zero-shot accuracy on LLaMA2 models at 70% sparsity for different unstructured pruning methods. The best results are highlighted in bold and the second-best results are indicated with underline.

BibTeX

@article{su2025rose,
  title={ROSE: Reordered SparseGPT for More Accurate One-Shot Large Language Models Pruning},
  author={Su, Mingluo and Wang, Huan},
  journal={arXiv preprint arXiv:2510.06751},
  year={2025}
}