Sparse Autoencoders

Exploring the mechanistic interpretability of large models using sparse autoencoders

Master Student: Haoran Ye | Advisor: Jimmy C.-H. Peng | Project Duration: 2024 - 2025

Cite this work as:

Li J, Ye H, Chen Y, et al. Training Superior Sparse Autoencoders for Instruct Models[J]. arXiv preprint arXiv:2506.07691, 2025.

SG_resp

💡 TL;DR:

In this project, we discover problems in previous SAE training approaches for instruct model :

📚 Suboptimal dataset selection affecting SAE performance.
✂️ Semantic discontinuity caused by block training truncating samples mid-content.

Therefore, we propose Finetuning-aligned Sequential Training (FAST)💪, a novel training method specifically tailored for instruct models. The results demonstrate:

Token Reconstruction Performance 📉: FAST shows token better reconstruction performance. On Qwen2.5-7B-Instruct, FAST achieves a mean squared error of 0.6468, significantly outperforming baseline methods with errors of 5.1985 and 1.5096.
Feature Interpretability 🎯: FAST yields a higher proportion of high-quality features. For Llama3.2-3B-Instruct, 21.1% scored in the top range, compared to 7.0% and 10.2% for BT(P) and BT(F).
Novel Discovery 🔍: Intervening on the activation of special tokens via the SAEs leads to improvements in output quality, suggesting new opportunities for fine-grained control of model behavior, enabling broad adoption and future research.

Find the details in our post below👇

Figure 1: Overview of the sparse autoencoder, illustrating its process for interpreting the internal representations of large language models.

SG_resp

Motivation: Why Traditional SAE Training Falls Short

Imagine reading a novel where every few pages, the story abruptly jumps to a completely different book—confusing📚✂️, right? This is essentially what happens with traditional Sparse Autoencoder (SAE) training methods for large language models!

Block Training (BT) has become the default approach for SAE training, where datasets (usually pretraining datasets) are concatenated into fixed-length blocks (Joseph Bloom and Chanin, 2024; Bricken et al., 2023). While this works reasonably well for base models—which are accustomed to processing random text chunks during pretraining—it creates significant problems for instruct models that have been fine-tuned to understand complete, coherent instructions.

Consider a typical 8,192-token training block: BT might stitch together 2,048 tokens from one sample with 6,144 tokens from another, creating jarring semantic discontinuities. For instruction-tuned models designed to maintain contextual understanding, this abrupt semantic "cliff edge" severely compromises their ability to align with downstream tasks and maintain coherent representations (Kissane et al., 2024b)

To solve this fundamental mismatch, we introduce Finetuning-aligned Sequential Training (FAST) —a novel method specifically designed for training SAEs on instruct models (See Figure 2). Unlike BT, our approach processes each data instance independently, preserving semantic integrity and maintaining alignment with the model's fine-tuning objectives. This ensures the model operates in a consistent semantic space during SAE training, ultimately enhancing both training quality and the model's ability to process instructions effectively.

Figure 2: Illustration of the LLM training pipeline and SAE training methods. (a) The pipeline transitions from pretraining to fine-tuning. (b) Block Training (BT) concatenates datasets and resplits them into fixed-lengthblocks. (c) Finetuning-aligned Sequential Training (FAST) processes data instances independently, preserving semantic integrity and improving alignment with fine-tuning objectives, leading to better performance in feature interpretability.

Experiments

Dataset: Combined multiple high-quality instruction datasets (WildChat-1M-Full, Infinity-Instruct, tulu-3-sft-mixture, orca-agentinstruct-1M-v1-cleaned, and lmsys-chat-1m), resulting in ~4.7M samples after deduplication. For BT(P), we use the Pile dataset to train the corresponding SAEs.
Models: Evaluated on 7 models from Llama (3.1, 3.2) and Qwen (2.5) series:

Results:

Figure 3: MSE performance of the JumpReLU SAE (all metrics are presented in log scale, where lower valuesindicate better SAE reconstruction performance). Within the JumpReLU architecture, FAST exhibits the best reconstruction capability compared to BT(P) and BT(F).

We find that in Figure 3:

Lower Error⬇️: FAST achieves the lowest MSE among all methods tested.
Better Token Reconstruction: FAST outperforms other methods in reconstructing both general and special tokens, especially on Llama and Qwen models.
Stronger Impact on Standard SAE: The improvement brought by FAST is more significant in Standard SAE, overcoming its previous limitations. For JumpReLU SAE, while the gains are smaller due to its already strong baseline, FAST still delivers meaningful performance improvements.

But can we intuitively feel this advantage？-> Feature Interpretability:

While metrics like MSE provide an objective, quantitative comparison of reconstruction capabilities across different SAE architectures, they can feel somewhat detached from practical experience. To better evaluate the real-world quality of an SAE, it is important to also consider experimental methods such as feature interpretability, which offer more intuitive insights into model performance.

Additional 10,000 instances are sampled and their activation values are computed. Then the top five sentences with the highest activation values are identified to construct an activation dataset for evaluating features. GPT-4o is prompted to score each group of five contexts and generate a descriptive summary.

There is the feature evaluation metric we designed for LLM followed by Llama Scope,2024:

Figure 4: Experiment results of feature interpretability.FAST achieves notable improvements compared to the other two training methods across all the tested models. FAST attains 21.1% of features rated in the highest quality range (scores 4-5), in contrast to 7.0% for BT(P) and 10.2% for BT(F).

The results demonstrate that FAST consistently outperforms BT(P) and BT(F) across all evaluated SAEs. For the 8x scaled Llama3.2-3B-Instruct, FAST achieves 21.1% of features in the highest quality range (scores 4-5), compared to 7.0% for BT(P) and 10.2% for BT(F).

Generally, compared to both baseline methods, we observe that FAST reduces the proportion of low-quality features while increasing the proportion of high-quality features in 8X and 16X SAEs. This highlights the superiority of FAST in producing more interpretable features during SAE training.

Furthermore, Cumulative Distribution Function (CDF) curve analysis reveals that FAST's percentage of features scoring below 3 is consistently the lowest. For instance, with Qwen2.5-3B-Instruct model, the CDF at score 3 is 76.5% for FAST, compared to 89.0% for BT(F) and 92.2% for BT(P), indicating fewer low-scoring features for FAST. These findings suggest that both appropriate training dataset selection for SAEs and the sequence training methodology contribute to enhanced model interpretability. FAST appears to successfully integrate these aspects, leading to more interpretable SAEs.

Case Study: Feature Steering

By performing steering operations on certain special features identified by the SAE, we are able to modify the model's original activation patterns and thereby influence its final output. This provides an intuitive demonstration of how the SAE decomposes the model into disentangled, semantically meaningful features. Since our models are trained in an instruction-tuned setting, we are particularly interested in understanding the roles of features most strongly associated with special tokens in Instruct models. Specifically, we investigate the features with the highest activation for <|im_start|> in Qwen2.5-7B-Instruct and <|start_header_id|> in Llama3.1-8B-Instruct, and apply targeted steering to these features. This allows us to explore how varying the steering coefficients affects the model's output on a range of question-answering tasks.

There exists some interesting results when applying SAE features to a concrete questions:

Figure 5: The steering output generated by Qwen2.5-7B-Instruct with Feature ID: 13794, focusing on user and <|im_start|> tokens for the Question 2 (entity description).

Figure 6: The steering output generated by Llama3.1-8B-Instruct with Feature ID: 22642, focusing on <|strart_header_id|> tokens for the Question 2 (entity description).

In Question 2, Qwen demonstrates optimal performance when feature 13794 is activated within a moderate range (specifically, setting α between 25 and 75). Within this range, the model produces coherent, detailed, and informative responses. However, when α is set too high (such as α≥100), Qwen exhibits severe degradation—generating hallucinations, producing repetitive content, and losing coherence in its outputs.Llama, in contrast, shows limited responsiveness. It only exhibits meaningful improvements within a narrow range of α=15 to 25. Within this window, the model demonstrates slightly enhanced politeness and helpfulness—though the improvements remain modest. Outside this range, it rapidly deteriorates into repetitive and incoherent output patterns.

Summary

This project introduces FAST (Finetuning-aligned Sequential Training), a novel method for training Sparse Autoencoders (SAEs) on instruction-tuned models. Traditional block training creates semantic discontinuities by concatenating unrelated text chunks, which poorly aligns with instruct models that expect coherent conversations.

FAST processes each dialogue independently, preserving semantic integrity. Results show significant improvements: FAST achieves 21.1% high-quality interpretable features vs 7.0-10.2% for baselines, and better token reconstruction (MSE 0.6468 vs 1.5096-5.1985 for baselines on Qwen2.5-7B). The method also enables effective model steering through SAE feature manipulation, demonstrating enhanced interpretability and control capabilities for instruction-following models.

Open Source Path

The whole post for this project can be referred in here. The paper on Arxiv is available here. The codes used in this study are available for download here.

Read the full paper

Back