Analyze Feature Flow to Enhance Interpretation and Steering in Language Models
We introduce a new approach to systematically map features discovered by sparse autoencoder across consecutive layers of large language models, extending earlier work that examined inter-layer feature links. By using a data-free cosine similarity technique, we trace how specific features persist, transform, or first appear at each stage. This method yields granular flow graphs of feature evolution, enabling fine-grained interpretability and mechanistic insights into model computations. Crucially, we demonstrate how these cross-layer feature maps facilitate direct steering of model behavior by amplifying or suppressing chosen features, achieving targeted thematic control in text generation. Together, our findings highlight the utility of a causal, cross-layer interpretability framework that not only clarifies how features develop through forward passes but also provides new means for transparent manipulation of large language models.
Discussion
Host: Hey everyone, welcome back to the podcast! Today, we're diving into some seriously fascinating research happening at the intersection of language models and interpretability. It's a bit of a technical deep dive, but I think the core concepts are incredibly cool and relevant to where AI is headed. We're talking about understanding what's happening inside these massive neural networks, which is still kind of a black box for most of us. I'm excited to unpack this!
Guest: Yeah, it's definitely a hot topic! Everyone's building these giant language models, but figuring out how they actually work – what's going on under the hood – is still a huge challenge. It's not just about getting them to generate text that sounds good; it's about understanding why they generate that text. What are they actually learning? And how can we be sure they're not learning biases or other undesirable behaviors along the way?
Host: Exactly! It's like, we've built these incredibly powerful machines, but we don't fully speak their language. That's where this research comes in. It’s all about 'Analyzing Feature Flow to Enhance Interpretation and Steering in Language Models.' It’s by Daniil Laptev, Nikita Balagansky, Yaroslav Aksenov, and Daniil Gavrilov. I know, it’s a mouthful. But it's doing some super interesting work on mapping features across different layers of a language model. We’ll get into what features means shortly. The traditional approach analyzes features within a single layer, but these guys are looking at how they evolve and interact across the entire network. And that's where the magic happens. It’s about feature flow.
Guest: Okay, so instead of just looking at a snapshot of what one layer is doing, they're trying to trace the journey of information as it flows through the model. That makes a lot of sense. These models are deep, multilayered structures, so you'd expect that the information gets transformed and refined as it moves from one layer to the next. You wouldn't expect the concepts and representations to just stay static.
Host: Precisely! Now, before we get lost in the weeds, let's clarify what we mean by 'features' here. It’s a crucial concept. Think of features as building blocks of understanding for the language model. The idea is these models are encoding concepts as linear directions within hidden representations. Features are essentially directions in vector space that corresponds to a specific concept. When you activate certain feature, it’s like you're turning on the light switch for that concept, and it has downstream impact. This paper uses something called 'sparse autoencoders,' or SAEs, to essentially disentangle these directions into what they call 'monosemantic features.' Which basically means each feature ideally represents one single, clear concept. It will decompose the model's hidden state into a sparse weighted sum of interpretable features.
Guest: Ah, okay. So the sparse autoencoder is like a tool to break down the complex internal representations of the language model into more manageable and understandable pieces. It’s like taking a complicated machine apart to see what each component is doing. The 'sparse' part is important because it means that only a few features are active at any given time, which makes the model more efficient and easier to interpret. And by using the sparse autoencoders, each feature represents something like 'noun' or 'verb'. It seems like one of the goals is to force the model to be more explicit in how it represents information.
Host: Yes, that's a great analogy. The SAE is trying to find features that are both interpretable and that sparsely represent information. It essentially makes the model's internal workings more transparent and understandable. And this paper's key contribution is to track these SAE features across multiple layers of the model. It's about seeing how these features originate, propagate, or vanish as the information flows through the network. Using their data-free approach, it helps to align SAE features across multiple modules at each layer, and track how features originate and propagate throughout the model in form of flow graphs.
Guest: So, how do they actually track these features across layers? What's their method for linking a feature in one layer to a corresponding feature in another layer?
Host: That's where the 'data-free cosine similarity technique' comes in. They use this to align the SAE features across the different modules – like the MLP, attention mechanism, and residual connections – at each layer. It works by comparing the decoder weights of the SAEs trained at different positions in the model. It helps to track how these directions evolve or appear across layers.
Guest: Decoder weights… okay, that sounds a bit technical. Can we break that down a little? What exactly are the decoder weights, and why is cosine similarity a good way to compare them?
Host: Sure thing. The decoder weights are part of the sparse autoencoder. Remember, the SAE's job is to reconstruct the hidden state of the language model using a sparse combination of features. The encoder takes the hidden state and maps it to a sparse set of feature activations, and the decoder takes those activations and tries to reconstruct the original hidden state. So, the decoder weights are essentially the vectors that map the feature activations back to the original hidden state space. The encoder produces feature activations, and the decoder reproduces hidden states. They're like the 'meaning' of each feature in the original representation space of the language model. High value means two features are similar.
Guest: That makes sense. So, by comparing the decoder weights, you're essentially comparing the 'meaning' of the features in different layers. And cosine similarity is a good metric because it measures the angle between two vectors, so it tells you how similar the directions of the features are, regardless of their magnitude. It’s a direction-based measure, so it focuses on the semantic content. This makes a lot of sense for comparing directions in high-dimensional vector spaces.
Host: Exactly! It’s data-free because it just needs the weights of the trained SAEs, not data. They calculate the cosine similarity between every pair of features across layers. Then, for each feature in a given layer, they find the feature in the next layer that has the highest cosine similarity. This gives them a way to map features from one layer to the next, creating these 'flow graphs' that show how features evolve through the model.
Guest: Alright, now I’m starting to get a clearer picture. The flow graph is a visualization of how the model processes information, tracing the lineage of these interpretable features as they travel through the layers. So, what have they found by using this approach? What kinds of patterns have they observed in these flow graphs?
Host: Well, that's where it gets really interesting. They've uncovered some distinct patterns of feature birth and refinement that they couldn't see with single-layer analyses. And the flow graph revealed an evolutionary pathway, which is also an internal circuit-like computational pathway. The graphs showed that the MLP and attention modules introduce new features or change already existing ones. This kind of analysis really starts to give us a peek into the model's inner workings.
Guest: So, the MLP and attention modules are not just processing information; they're actively creating new features and modifying existing ones. It's like they're the 'creative' parts of the network, responsible for generating new representations. And the flow graph lets you see how these creative processes unfold across the layers. It's almost like watching the model 'thinking' in slow motion.
Host: That’s a good analogy. The MLP and attention modules shape information. Let's talk a bit more about these modules. As a quick refresher, the MLP, or Multilayer Perceptron, is a feedforward neural network that applies non-linear transformations to the hidden states. Think of it as adding complexity and expressiveness to the representations. It introduces non-linear combinations of features to capture dependencies.
Guest: Okay, so MLP adds complexity. What about the attention module?
Host: The attention mechanism allows the model to focus on the most relevant parts of the input sequence when processing each token. It's like highlighting the important words in a sentence to understand the context. It establishes relationships between different parts of the input. The attention mechanism weights connections differently based on the input. They read from and write into residual stream. The residual stream serves as a communication channel. The MLP and attention modules read the information, and their outputs produce information to residual stream. The modules read from the residual stream and outputs back to the residual stream.
Guest: And this paper said that the features in the residual stream remain relatively unchanged across layers, right? That's kind of surprising.
Host: Yeah, it is! The residual stream is the main conduit. Think of it as a highway. But Balagansky et al. (2024), most features in the residual stream stay relatively unchanged across layers. This suggests that the core semantic content is largely preserved as it flows through the network. The changes or computations happen in the modules. However, the MLP and attention modules are constantly reading from this stream, processing the information, and writing their outputs back into it, thereby shaping the flow of information.
Guest: So, the residual stream provides the backbone, and the MLP and attention modules are the sculptors, constantly refining and transforming the information as it passes through. This creates a dynamic interplay that ultimately determines the model's behavior.
Host: Precisely. That dynamic interplay is what this paper is trying to capture with its feature flow graphs. And they've found that by understanding these flow graphs, you can actually improve the quality of model steering. By building a flow graph, we uncover an evolutionary pathway, which is also an internal circuit-like computational pathway, where MLP and attention modules introduce new features to already existing ones or change them.
Guest: Model steering… that’s a cool concept. Can you elaborate a bit? What do they mean by 'steering' the model, and how do these flow graphs help?
Host: Steering, in this context, refers to directly influencing the model's behavior by manipulating its internal representations. Traditionally, people change the input prompts to change the output. These researchers are changing the internal mechanisms and representations to change the output! The paper shows that flow graphs can improve model steering by targeting multiple SAE features. It offers a better understanding of the steering outcome. It's like you’re taking the wheel inside the model's head and guiding it towards the desired outcome. The idea is that by carefully amplifying or suppressing specific features, you can control the themes or topics that the model generates.
Guest: Ah, I see! So, instead of just prompting the model with different text, you're actually going in and tweaking the internal knobs and dials to make it generate text on a specific topic. That's a much more direct and controlled way of influencing the model's output.
Host: Exactly. And the flow graphs provide a roadmap for how to do this effectively. By identifying the features that are most relevant to a particular theme, and by understanding how those features evolve across the layers, you can target your interventions more precisely. It's not just about activating or deactivating a single feature; it's about coordinating the activation of multiple features across multiple layers to achieve a desired effect.
Guest: Okay, so it's a more holistic approach to model steering. Instead of just poking at individual neurons, you're trying to orchestrate the activity of entire circuits within the network. It's like conducting an orchestra instead of just hitting a single note.
Host: That’s a great analogy! And it's not just about steering the model towards a specific topic. It’s also about understanding the consequences of your interventions. The flow graphs can help you predict how the model's behavior will change after you've manipulated certain features. It allows for transparent manipulation of large language models.
Guest: So, it's a more interpretable form of steering. You're not just blindly tweaking parameters; you have a way to understand why your interventions are having the effects that they do. That's crucial for building trust and safety into these systems.
Host: Absolutely. The paper helps to discover the lifespan of SAE features, understand their evolution across layers, and shed light on how they form computational circuits, which enables more precise control over model behavior. I think, that covers the introduction and preliminaries to the paper. Next, let's deep dive into the technical aspects and the method they use to trace features.
Guest: Sounds good! I'm eager to understand the nuts and bolts of how they actually build these flow graphs.
Host: Alright, let's dive into the method section. To recap, the goal is to find features shared by two SAEs trained at different positions. Their key idea is, if we want to find features shared by two SAEs trained at positions 𝐴 and 𝐵 , we need to discover a mapping. Several methods exist for matching features between layers and modules. This drives methods for finding these shared features and architectures that ensure persistent collections of features by design.
Guest: So it's about finding a way to translate between the feature spaces of different layers. It's like having a dictionary that tells you which feature in layer A corresponds to which feature in layer B. And that mapping is the key to building the flow graph.
Host: Exactly! Let’s consider how we do it. One approach uses correlations between activations. But it requires a lot of data to compute activation statistics. Another is a data-free approach based on SAE weights. The paper focuses on cosine similarity between decoder weights as a similarity metric. It focuses on this approach because cosine similarity is a valuable similarity metric.
Guest: Okay, so they're opting for the data-free approach, which is more efficient. It avoids the need to gather large amounts of data to calculate activation statistics. They mentioned the vector is the 𝑖 th column of 𝑾 dec ( 𝐴 ). What does this mean?
Host: Okay, so let’s say 𝐟 is the embedding of some feature ℱ 𝑖 ( 𝐴 ), trained at position 𝐴 . This vector is the 𝑖 th column of 𝑾 dec ( 𝐴 ). Also let 𝑾 dec ( 𝐵 )∈ ℝ 𝑑 ×|ℱ| be the decoder weights of an SAE trained at position 𝐵 . We find the matched feature index. Then we say that ℱ 𝑖 ( 𝐴 ) corresponds to ℱ 𝑗 ( 𝐵 ). They assume that both 𝐟 and the columns of 𝑾 dec ( 𝐵 ) have unit norm.
Guest: Okay, so the vector 𝐟 is the embedding of the 𝑖 th feature at position A. It means we're comparing the features in different positions. By finding the maximum cosine similarity, you're finding the feature in position B that's most similar to the 𝑖 th feature in position A.
Host: That's right! Then they say that ℱ 𝑖 ( 𝐴 ) corresponds to ℱ 𝑗 ( 𝐵 ). They define 𝐓 ( 𝐴 → 𝐵 ). This is where 𝑘 comes in, the top-𝑘 operator. In the paper, 𝑘 = 1. This many-to-one matching extends the one-to-one approach from prior work. It extends previous work. The equation handles many-to-many cases. But the paper focuses on many-to-one as a substantial extension of previous work.
Guest: Got it. So, they're extending previous research that focused on one-to-one mappings between features. Now, they're exploring many-to-one mappings, which means that multiple features in layer A can map to a single feature in layer B. Does this imply that the features in layer A can be thought of as components of one feature in layer B?
Host: In some sense, yes. It can mean that several features might be combined. Their technique assumes SAEs are trained on hidden states whose structure is aligned. If the hidden states are not aligned, the method cannot be applied. The distribution of data also affects these results.
Guest: I see. So, they're relying on the assumption that the hidden states in different layers have a similar structure, so that comparing the decoder weights makes sense. And you also need to be careful about the data distribution, as that can affect the results. The hidden states are usually aligned to run the SAE's. Can you describe in more detail how they track the evolution of the feature after matching?
Host: host
Guest: So, the four main computational points they are looking at are the input and outputs of each layer and module. Then they calculate the maximum cosine similarity to infer how the feature relates to the previous layer or modules. What are the 4 possibilities the paper mentions?
Host: The four possibilities based on the similarity scores are: A) High 𝑠 ( 𝑅 ) and low 𝑠 ( 𝑀 ), 𝑠 ( 𝐴 ): The feature likely existed in 𝑅 𝐿 −1 and was translated to 𝑅 𝐿 . B) High 𝑠 ( 𝑅 ) and high 𝑠 ( 𝑀 ) or 𝑠 ( 𝐴 ): The feature was likely processed by the MLP or attention. C) Low 𝑠 ( 𝑅 ) but high 𝑠 ( 𝑀 ) or 𝑠 ( 𝐴 ): The feature may be newborn, created by the MLP or attention. D) Low 𝑠 ( 𝑅 ) and low 𝑠 ( 𝑀 ), 𝑠 ( 𝐴 ): The feature cannot be easily explained by maximum cosine similarity alone. The threshold for high and low are specific for each layer.
Guest: Okay, that makes sense. So, they're using the cosine similarity scores to categorize each feature based on its relationship to the previous layer and the MLP and attention modules. What's the reasoning for the low 𝑠 ( 𝑅 ) but high 𝑠 ( 𝑀 ) or 𝑠 ( 𝐴 ) case?
Host: The reasoning is that a low score on 𝑠 ( 𝑅 ) means that the feature is not strongly related to the previous layer's output. But a high score on 𝑠 ( 𝑀 ) or 𝑠 ( 𝐴 ) means that it's strongly related to the MLP or attention module. It suggests that the feature wasn't present in the previous layer but was created or significantly modified by the MLP or attention module in the current layer. Thus, the feature may be newborn, created by the MLP or attention. The model may be creating a new feature.