ROSE: Reordered SparseGPT for More Accurate One-Shot Large Language Models Pruning

CPAL 2026
1Westlake University
*Corresponding author: wanghuan [at] westlake [dot] edu [dot] cn
Westlake University
ENCODE LAB

Abstract

Pruning is widely recognized as an effective method for reducing the parameters of large language models (LLMs), potentially leading to more efficient deployment and inference. One classic and prominent path of one-shot LLM pruning is to leverage second-order gradients (i.e., Hessian), represented by the pioneering work SparseGPT. However, the predefined left-to-right pruning order in SparseGPT leads to suboptimal performance when the weights exhibit columnar patterns. This paper studies the effect of pruning order under the SparseGPT framework. The analyses lead us to propose ROSE, a reordered SparseGPT method that prioritizes weight columns with larger potential pruning errors to be processed earlier. ROSE first performs pre-pruning to identify weights that are highly likely to be pruned, and estimates both column-wise and block-wise pruning loss. The relative range of block loss is used as a metric to identify columnar layers and perform adaptive reordering for them. For the reordering operation, columns within each block are reordered in descending order of column loss, while blocks are reordered in descending order of block loss. Substantial empirical results on prevalent LLMs (LLaMA2-7B/13B/70B, LLaMA3-8B, Mistral-7B) demonstrate that ROSE surpasses the original SparseGPT and other counterpart pruning methods.

Motivation

Motivation

(a) Change in reconstruction error of the "self\_attn.o\_proj" layer in the first Transformer Block of LLaMA2-7B during SparseGPT pruning as the number of pruned blocks increases. The sharpest increase appears at a later stage. (b) Weight visualization shows a columnar pattern along the input channel, with one block containing the most concentrated high-magnitude weights. (c) Reconstruction error after reordering: pruning the high-error block earlier yields lower total error.

Overview of our ROSE method

Overview of the ROSE

(a) Illustration of difference between SparseGPT and ROSE. In SparseGPT, as pruning progresses, fewer weights remain for compensation. If high-error weights are pruned late, compensation is limited. ROSE reorders columns so high-error ones are pruned early, preserving more parameters for compensation. (b) ROSE workflow for a columnar layer: given dense weight W, compute importance scores S, split into blocks, select smallest p% per block as loss matrix, compute column/block losses, then reorder columns (within block) and blocks in descending loss order.

Main Results

BibTeX

@article{su2025rose,
  title={ROSE: Reordered SparseGPT for More Accurate One-Shot Large Language Models Pruning},
  author={Su, Mingluo and Wang, Huan},
  journal={arXiv preprint arXiv:2510.06751},
  year={2025}
}