Back to Main Conference 2026
LREC 2026main

Towards Reliable AI Fairness: Challenges in Steering Features within Bias-Implicated Neurons

Proceedings of the Fifteenth Language Resources and Evaluation Conference (LREC 2026)

DOI:10.63317/2iexsnkqn3j6

Abstract

LLMs perpetuate societal biases, such as gender stereotypes, reinforcing harmful norms and posing significant fairness risks in real-world applications. We investigate a fine-grained mitigation technique that moves beyond surface-level fixes. Our approach uses attribution graphs to identify and directly steer bias-implicated features within a Sparse Autoencoder’s (SAE) latent space. This method, known as feature steering, offers a theoretically precise, surgical intervention aimed at correcting bias at its neural source without costly retraining. We critically examine its practical reliability across various contexts. We find that steering effectiveness is highly sensitive to parameter tuning, often requiring unpredictable, context-specific adjustments. The intervention’s success exists in narrow "sweet spots," outside of which performance can degrade catastrophically. This demonstrates that while direct intervention on learned features is a powerful analytical tool, significant challenges of brittleness and instability hinder its application as a consistent, broad-scale debiasing solution, necessitating research into more robust control mechanisms.

Details

Paper ID
lrec2026-main-306
Pages
pp. 3851-3860
BibKey
garridomunoz-etal-2026-reliable
Editor
N/A
Publisher
European Language Resources Association (ELRA)
ISSN
2522-2686
ISBN
978-2-493814-49-4
Conference
The Fifteenth Language Resources and Evaluation Conference (LREC 2026)
Location
Palma, Mallorca, Spain
Date
11 May 2026 16 May 2026

Authors

  • IG

    Ismael Garrido-Munoz

  • AM

    Arturo Montejo-Raez

  • FM

    Fernando Martí­nez-Santiago

Links