johnswentworth

Sequences

"Why Not Just..."
Basic Foundations for Agent Models
Framing Practicum
Gears Which Turn The World
Abstraction 2020
Gears of Aging
Model Comparison

Wiki Contributions

One can argue that the goal-aligned model has an incentive to preserve its goals, which would result in an aligned model after SLT. Since preserving alignment during SLT is largely outsourced to the model itself, arguments for alignment techniques failing during an SLT don't imply that the plan fails...

I think this misses the main failure mode of a sharp left turn. The problem is not that the system abandons its old goals and adopts new goals during a sharp left turn. The problem is that the old goals do not generalize in the way we humans would prefer, as capabilities ramp up. The model keeps pursuing those same old goals, but stops doing what we want because the things we wanted were never optimal for the old goals in the first place. Outsourcing goal-preservation to the model should be fine once capabilities are reasonably strong, but goal-preservation isn't actually the main problem which needs to be solved here.

(Or perhaps you're intentionally ignoring that problem by assuming "goal-alignment"?)

First: what's the load-bearing function of visualizations in math?

I think it's the same function as prototypical examples more broadly. They serve as a consistency check - i.e. if there's any example at all which matches the math then at least the math isn't inconsistent. They also offer direct intuition for which of the assumptions are typically "slack" vs "taut" - i.e. in the context of the example, would the claim just totally fall apart if we relax a particular assumption, or would it gracefully degrade? And they give some intuition for what kinds-of-things to bind the mathematical symbols to, in order to apply the math.

Based on that, I'd expect that non-visual prototypical examples can often serve a similar role.

Also, some people use type-tracking to get some of the same benefits, though insofar as that's a substitute for prototypical example tracking I think it's usually inferior.

I've been trying to push against the tendency for everyone to talk about FTX drama lately, but I have some generalizable points on the topic which I haven't seen anybody else make, so here they are. (Be warned that I may just ignore responses, I don't really want to dump energy into FTC drama.)

Summary: based on having worked in startups a fair bit, Sam Bankman-Fried's description of what happened sounds probably accurate; I think he mostly wasn't lying. I think other people do not really get the extent to which fast-growing companies are hectic and chaotic and full of sketchy quick-and-dirty workarounds and nobody has a comprehensive view of what's going on.

Long version: at this point, the assumption/consensus among most people I hear from seems to be that FTX committed intentional, outright fraud. And my current best guess is that that's mostly false. (Maybe in the very last couple weeks before the collapse they toed the line into outright lies as a desperation measure, but even then I think they were in pretty grey territory.)

Key pieces of the story as I currently understand it:

• Moving money into/out of crypto exchanges is a pain. At some point a quick-and-dirty solution was for customers to send money to Alameda (Sam Bankman-Fried's crypto hedge fund), and then Alameda would credit them somehow on FTX.
• Customers did rather a lot of that. Like, $8B worth. • The FTX/Alameda team weren't paying attention to those particular liabilities; they got lost in the shuffle. • At some point in the weeks before the collapse, when FTX was already under moderate financial strain, somebody noticed the$8B liability sitting around. And that took them from "moderate strain" to "implode".

How this contrasts with what seems-to-me to be the "standard story": most people seem to assume that it is just totally implausible to accidentally lose track of an \$8B liability. Especially when the liability was already generated via the decidedly questionable practice of routing customer funds for the exchange through a hedge fund owned by the same people. And therefore it must have been intentional - in particular, most people seem to think the liability was intentionally hidden.

I think the main reason I disagree with others on this is that I've worked at a startup. About 5 startups, in fact, over the course of about 5 years.

The story where there was a quick-and-dirty solution (which was definitely sketchy but not ill-intentioned), and then stuff got lost in the shuffle, and then one day it turns out that there's a giant unanticipated liability on the balance sheet... that's exactly how things go, all the time. I personally was at a startup which had to undergo a firesale because the accounting overlooked something. And I've certainly done plenty of sketchy-but-not-ill-intentioned things at startups, as quick-and-dirty solutions. The story that SBF told about what happened sounds like exactly the sort of things I've seen happen at startups many times before.

On my current understanding, this is true but more general; the natural abstraction hypothesis makes narrower predictions than that.

This basically correct, other than the part about not having any guarantee that the information is in a nice format. The Maxent and Abstractions arguments do point toward a relatively nice format, though it's not yet clear what the right way is to bind the variables of those arguments to stuff in a neural net. (Though I expect the data structures actually used will have additional structure to them on top of the maxent form.)

Meta: I'm going through a backlog of comments I never got around to answering. Sorry it took three months.

I've assumed it would be possible to reweight things to focus on a better distribution of data points, because it seems like there would be some very mathematically natural ways of doing this reweighting. Is this something you've experimented with?

Something along those lines might work; I didn't spend much time on it before moving to a generative model.

When you say "directly applied", what do you mean?

The actual main thing I did was to compute the SVD of the jacobian of a generative network output (i.e. the image) with respect to input (i.e. the latent vector). Results of interest:

• Conceptually, near-0 singular values indicate a direction-in-image-space in which no latent parameter change will move the image - i.e. locally-inaccessible directions. Conversely, large singular values indicate "degrees of freedom" in the image. Relevant result: if I take two different trained generative nets, and find latents for each such that they both output approximately the same image, then they both roughly agree on what directions-in-image-space are local degrees of freedom.
• By taking the SVD of the jacobian of a chunk of the image with respect to the latent, we can figure out which directions-in-latent-space that chunk of image is locally sensitive to. And then, a rough local version of the natural abstraction hypothesis would say that nonadjacent chunks of image should strongly depend on the same small number of directions-in-latent-space, and be "locally independent" (i.e. not highly sensitive to the same directions-in-latent-space) given those few. And that was basically correct.

To be clear, this was all "rough heuristic testing", not really testing predictions carefully derived from the natural abstraction framework.

I think these flaws point towards that when we do interpretability, we more want to impose some structure on the network. That is, we want to find some set of conditions that could occur in reality, where we can know that if these conditions occur, the network satisfies some useful property (such as "usually classifies things correctly").

The main difficulty with this is, it requires a really good understanding of reality?

There we go!

So, one item on my list of posts to maybe get around to writing at some point is about what's missing from current work on interpretability, what bottlenecks would need to be addressed to get the kind of interpretability we ideally want for application to alignment, and how True Names in general and natural abstraction specifically fit into the picture.

The OP got about half the picture: current methods mostly don't have a good ground truth. People use toy environments to work around that, but then we don't know how well tools will generalize to real-world structures which are certainly more complex and might even be differently complex.

The other half of the picture is: what would a good ground truth for interpretability even look like? And as you say, the answer involves a really good understanding of reality.

Unpacking a bit more: "interpret" is a two-part word. We see a bunch of floating-point numbers in a net, and we interpret them as an inner optimizer, or we interpret them as a representation of a car, or we interpret them as fourier components of some signal, or .... Claim: the ground truth for an interpretability method is a True Name of whatever we're interpreting the floating-point numbers as. The ground truth for an interpretability method which looks for inner optimizers is, roughly speaking, a True Name of inner optimization. The ground truth for an interpretability method which looks for representations of cars is, roughly speaking, a True Name of cars (which presumably routes through some version of natural abstraction). The reason we have good ground truths for interpretability in various toy problems is because we already know the True Names of all the key things involved in those toy problems - like e.g. modular addition and Fourier components.

The "most" was doing key work in that sentence you quoted.

I totally buy that antiobiotic resistance is a large and growing problem. The part which seems like obvious bullshit is the claim that the cost outweighs the benefit, or is even remotely on the same order of magnitude, especially when we're talking about an area like sub-Saharan Africa. Do any of those studies have a cost-benefit analysis?

(Also, side note: antibiotic resistance is totally in the news regularly. Here's one from yesterday.)

To be clear, I don't think the claim that self-medicated antibiotic use causes more antibiotic resistance is obvious bullshit. Maybe the effect size is close to zero outside of hospitals, maybe it's not, but the claim isn't obvious bullshit either way.

The "obvious bullshit" part is the (implicit) claim that the cost outweighs the benefit, or is even remotely on the same order of magnitude, especially when we're talking about an area where the alternative is usually "don't use antibiotics at all".

Object level comment: Antibiotic resistance is bad, this is likely to make it worse, probably without saving lives. You probably shouldn't self-medicate with antibiotics, you definitely shouldn't give them to others without knowing more about medical diagnosis.

I've certainly heard arguments along those lines before. They seem like obvious bullshit. Evidence: in most of the world, antibiotics are readily available over-the-counter, and yet I don't hear about most of the world's human-infecting bacteria becoming antibiotic resistant. Most of the world continues to use antibiotics, as a self-medication, and year after year they keep mostly working.

It seems to me like a very strong analogue to Oregon and New Jersey's laws about pumping your own gas. Both of those states don't allow it, which the rest of us know is completely stupid, but there's still somehow a debate about it because lots of people make up reasons why it would be very dangerous to allow people to pump their own gas.