Tom Lieberum

Research Engineer at DeepMind, focused on mechanistic interpretability and large language models. Opinions are my own.

Wiki Contributions


Yeah I agree with that. But there is also a sense in which some (many?) features will be inherently sparse.

  • A token is either the first one of multi-token word or it isn't.
  • A word is either a noun, a verb or something else.
  • A word belongs to language LANG and not to any other language/has other meanings in those languages.
  •  image can only contain so many objects which can only contain so many sub-aspects.

I don't know what it would mean to go "out of distribution" in any of these cases.

This means that any network that has an incentive to conserve parameter usage (however we want to define that), might want to use superposition.

Do superposition features actually seem to work like this in practice in current networks? I was not aware of this.

I'm not aware of any work that identifies superposition in exactly this way in NNs of practical use. 
As Spencer notes, you can verify that it does appear in certain toy settings though. Anthropic notes in their SoLU paper that they view their results as evidence for the SPH in LLMs. Imo the key part of the evidence here is that using a SoLU destroys performance but adding another LayerNorm afterwards solves that issue. The SoLU selects strongly against superposition and LayerNorm makes it possible again, which is some evidence that the way the LLM got to its performance was via superposition.


ETA: Ofc there could be some other mediating factor, too.

This example is meant to only illustrate how one could achieve this encoding. It's not how an actual autoencoder would work. An actual NN might not even use superposition for the data I described and it might need some other setup to elicit this behavior.
But to me it sounded like you are sceptical that superposition is nothing but the network being confused whereas I think it can be the correct way to still be able to reconstruct the features to a reasonable degree.

Ah, I might have misunderstood your original point then, sorry! 

I'm not sure what you mean by "basis" then. How strictly are you using this term?

I imagine you are basically going down the "features as elementary unit" route proposed in Circuits (although you might not be pre-disposed to assume features are the elementary unit).Finding the set of features used by the network and figuring out how its using them in its computations does not 1-to-1 translate to "find the basis the network is thinking in" in my mind.

Possibly the source of our disagreement here is that you are imagining the neuron ought to be strictly monotonically increasing in activation relative to the dog-headedness of the image?

If we abandon that assumption then it is relatively clear how to encode two numbers in 1D. Let's assume we observe two numbers . With probability , and with probability 

We now want to encode these two events in some third variable , such that we can perfectly reconstruct  with probability .

I put the solution behind a spoiler for anyone wanting to try it on their own.

Choose some veeeery large  (much greater than the variance of the normal distribution of the features). For the first event, set . For the second event, set .

The decoding works as follows:

If  is negative, then with probability  we are in the first scenario and we can set . Vice versa if  is positive.

I'd say that there is a basis the network is thinking in in this hypothetical, it would just so happens to not match the human abstraction set for thinking about the problem in question.

Well, yes but the number of basis elements that make that basis human interpretable could theoretically be exponential in the number of neurons.

If due to superposition, it proves advantageous to the AI to have a single feature that kind of does dog-head-detection and kind of does car-front-detection, because dog heads and car fronts don't show up in the training data at the same time, so it can still get perfect loss through a properly constructed dual-purpose feature like this, it'd mean that to the AI, dog heads and car fronts are "the same thing".

I don't think that's true. Imagine a toy scenario of two features that run through a 1D non-linear bottleneck before being reconstructed. Assuming that with some weight settings you can get superposition, the model is able to reconstruct the features ≈perfectly as long as they don't appear together. That means the model can still differentiate the two features, they are different in the model's ontology.

As AIs get more capable and general, I'd expect the concepts/features they use to start more closely matching the ones humans use in many domains.

My intuition disagrees here too. Whether we will observe superposition is a function of (number of "useful" features in the data), (sparsity of said features), and something like (bottleneck size). It's possible that bottleneck size will never be enough to compensate for number of features. Also it seems reasonable to me that ≈all of reality is extremely sparse in features, which presumably favors superposition.

I agree that all is not lost wrt sparsity and if SPH turns out to be true it might help us disentangle the superimposed features to better understand what is going on. You could think of constructing an "expanded" view of a neural network. The expanded view would allocate one neuron per feature and thus has sparse activations for any given data point and would be easier to reason about. That seems impractical in reality, since the cost of constructing this view might in theory be exponential, as there are exponentially many "almost orthogonal" vectors for a given vector space dimension, as a function of the dimension.

I think my original comment was meant more as a caution against the specific approach of "find an interpretable basis in activation space", since that might be futile, rather than a caution against all attempts at finding a sparse representation of the computations that are happining within the network.

I don't think there is anything on that front other than the paragraphs in the SoLU paper. I alluded to a possible experiment for this on Twitter in response to that paper but haven't had the time to try it out myself: You could take a tiny autoencoder to reconstruct some artificially generated data where you vary attributes such as sparsity, ratio of input dimensions vs. bottleneck dimensions, etc. You could then look at the weight matrices of the autoencoder to figure out how it's embedding the features in the bottleneck and which settings lead to superposition, if any.

I disagree with your intuition that we should not expect networks at irreducible loss to not be in superposition.

The reason I brought this up is that there are, IMO, strong first-principle reasons for why SPH should be correct. Say there are two features, which have an independent probability of 0.05 to be present in a given data point, then it would be wasteful to allocate a full neuron to each of these features. The probability of both features being present at the same time is a mere 0.00025. If the superposition is implemented well you get basically two features for the price of one with an error rate of 0.025%. So if there is even a slight pressure towards compression, e.g. by having less available neurons than features, then superposition should be favored by the network.

Now does this toy scenario map to reality? I think it does, and in some sense it is even more favorable to SPH since often the presence of features will be anti-correlated. 

Load More