Towards Reliable AI Fairness: Challenges in Steering Features within Bias-Implicated Neurons
Proceedings of the Fifteenth Language Resources and Evaluation Conference (LREC 2026)
Abstract
LLMs perpetuate societal biases, such as gender stereotypes, reinforcing harmful norms and posing significant fairness risks in real-world applications. We investigate a fine-grained mitigation technique that moves beyond surface-level fixes. Our approach uses attribution graphs to identify and directly steer bias-implicated features within a Sparse Autoencoder’s (SAE) latent space. This method, known as feature steering, offers a theoretically precise, surgical intervention aimed at correcting bias at its neural source without costly retraining. We critically examine its practical reliability across various contexts. We find that steering effectiveness is highly sensitive to parameter tuning, often requiring unpredictable, context-specific adjustments. The intervention’s success exists in narrow "sweet spots," outside of which performance can degrade catastrophically. This demonstrates that while direct intervention on learned features is a powerful analytical tool, significant challenges of brittleness and instability hinder its application as a consistent, broad-scale debiasing solution, necessitating research into more robust control mechanisms.