Pruning LLMs for Retrieval: Why Attention Matters and MLPs Don't

Pruning LLMs for Retrieval: Why Attention Matters and MLPs Don't

Machine learning at scale
Machine learning at scaleApr 12, 2026

Key Takeaways

  • EffiR prunes MLP layers, keeps attention intact.
  • 50% parameter reduction, 2x speedup on Mistral-7B.
  • Retrieval performance on BEIR benchmarks remains near original.
  • Pruned model outperforms native 3B models on dense retrieval.
  • Stacks with 4-bit quantization, preserving accuracy.

Pulse Analysis

The rise of large language models as dense retrievers has outpaced traditional BERT‑based solutions, offering richer semantic encoding and zero‑shot capabilities. However, the computational overhead of running a 7‑billion‑parameter model for every query or document vectorization creates prohibitive latency and cost for real‑time search and retrieval‑augmented generation (RAG) pipelines. Companies have turned to quantization and distillation, yet structural pruning—removing entire layers or neurons—has lagged because most pruning heuristics were derived from generative, next‑token prediction tasks.

EffiR challenges that assumption by showing that, for embedding models, attention layers are the critical component while MLP blocks become largely redundant. The framework first conducts a coarse‑grained depth reduction, dropping up to sixteen MLP layers in Mistral‑7B based on cosine similarity of sub‑layer activations. A subsequent fine‑grained width reduction introduces learnable gating masks that prune inactive neurons within the remaining MLPs. The result is a model that retains the global semantic aggregation power of attention, yet sheds the heavy MLP compute, achieving roughly a 50% drop in parameters and a two‑fold inference speed increase with negligible BEIR performance degradation.

From a business perspective, EffiR pushes the efficiency‑performance frontier, allowing firms to leverage the reasoning capacity of a 7B architecture without incurring its full cost. The pruned model not only beats native small models on retrieval benchmarks but also pairs seamlessly with 4‑bit NF4 quantization, delivering further memory and latency reductions. This opens the door for cost‑effective, high‑throughput vector search services and more responsive RAG applications, prompting a reevaluation of off‑the‑shelf pruning tools that indiscriminately target attention and MLPs alike.

Pruning LLMs for Retrieval: Why Attention Matters and MLPs Don't

Comments

Want to join the conversation?