Ege Erdil

If you have any questions for me or just want to talk, feel free to reach out by sending a private message on this site or by sending an e-mail to egeerdil96@gmail.com.

You can also find me on Metaculus at https://www.metaculus.com/accounts/profile/116023/, or on Discord using the tag Starfall#7651.

Wiki Contributions

Comments

I don't think this is the case, because actually the claim "paracetamol helps with fever" is fairly underdetermined.

This meta-analysis says that paracetamol is effective at reducing fever in a short time, but doesn't seem to accelerate fever clearance in a statistically significant way. So it depends on what you mean by "reducing fever"; as short-term impact on fever is usually considered a secondary outcome in these studies.

It's true that the mechanism of action is not well understood. However, depending on exactly how you interpret the claim above, it might be true that there is "scientific evidence" for it, or it might not be. Ditto for a claim such as "racial group differences in IQ tests are genetic". What exactly are we talking about when we make this claim?

What's "scientific evidence"? Is there scientific evidence that paracetamol helps with fever, for example?

This looks correct to me - this is indeed how the model is able to disentangle algorithmic progress from scaling of training compute budgets.

The problems you mention are even more extreme with dataset size because plenty of the models in our analysis were only trained on ImageNet-1k, which has around 1M images. So more than half of the models in our dataset actually just use the exact same training set, which makes our model highly uncertain about the dataset side of things.

In addition, the way people typically incorporate extra data is by pretraining on bigger, more diverse datasets and then fine-tuning on ImageNet-1k. This is obviously different from sampling more images from the training distribution of ImageNet-1k, though bigger datasets such as ImageNet-21k are constructed on purpose to be similar in distribution to ImageNet-1k. We actually tried to take this into account explicitly by introducing some kind of transfer exponent between different datasets, but this didn't really work better than our existing model.

One final wrinkle is the irreducible loss of ImageNet. I tried to get some handle on this by reading the literature, and I think I would estimate a lower bound of maybe 1-2% for top 1 accuracy, as this seems to be the fraction of images that have incorrect labels. There's a bigger fraction of images that could plausibly fit multiple categories at once, but models seem to be able to do substantially better than chance on these examples, and it's not clear when we can expect this progress to cap out.

Our model specification assumes that in the infinite compute and infinite data limit you reach 100% accuracy. This is probably not exactly right because of irreducible loss, but because models are currently around over 90% top-1 accuracy I think it's probably not too big of a problem for within-distribution inference, e.g. answering questions such as "how much software progress did we see over the past decade". Out-of-distribution inference is a totally different game and I would not trust our model with this for a variety of reasons - the biggest reason is really the lack of diversity and the limited size of the dataset and doesn't have much to do with our choice of model.

To be honest, I think ImageNet-1k is just a bad benchmark for evaluating computer vision models. The reason we have to use it here is that all the better benchmarks that correlate better with real-world use cases have been developed recently and we have no data on how past models perform on these benchmarks. When we were starting this investigation we had to make a tradeoff between benchmark quality and the size & diversity of our dataset, and we ended up going for ImageNet-1k top 1 accuracy for this reason. With better data on superior benchmarks, we would not have made this choice.

I think this question is interesting but difficult to answer based on the data we have, because the dataset is so poor when it comes to unusual examples that would really allow us to answer this question with confidence. Our model assumes that they are substitutes, but that's not based on anything we infer from the data.

Our model is certainly not exactly correct, in the sense that there should be some complementarity between compute and algorithms, but the complementarity probably only becomes noticeable for extreme ratios between the two contributions. One way to think about this is that we can approximate a CES production function

in training compute and algorithmic efficiency when by writing it as

which means the first-order behavior of the function around doesn't depend on , which is the parameter that controls complementarity versus substitutability. Since people empirically seem to train models in the regime where is close to this makes it difficult to identify from the data we have, and approximating by a Cobb-Douglas (which is what we do) does about as well as anything else. For this reason, I would caution against using our model to predict the performance of models that have an unusual combination of dataset size, training compute, and algorithmic efficiency.

In general, a more diverse dataset containing models trained with unusual values of compute and data for the year that they were trained in would help our analysis substantially. There are some problems with doing this experiment ourselves: for instance, techniques used to train larger models often perform worse than older methods if we try to scale them down. So there isn't much drive to make algorithms run really well with small compute and data budgets, and that's going to bias us towards thinking we're more bottlenecked by compute and data than we actually are.

I would guess that making progress on AGI would be slower. Here are two reasons I think are particularly important:

  1. ImageNet accuracy is a metric that can in many ways be gamed; so you can make progress on ImageNet that is not transferable to more general image classification tasks. As an example of this, in this paper the authors conduct experiments which confirm that adversarially robust training on ImageNet degrades ImageNet test or validation accuracy, but robustly trained models generalize better to classification tasks on more diverse datasets when fine-tuned on them.

    This indicates that a lot of the progress on ImageNet is actually "overlearning": it doesn't generalize in a useful way to tasks we actually care about in the real world. There's good reason to believe that part of overlearning would show up as algorithmic progress in our framework, as people can adapt their models better to ImageNet even without extra compute or data.

  2. Researchers have stronger feedback loops on ImageNet: they can try something directly on the benchmark they care about, see the results and immediately update on their findings. This allows them to iterate much faster and iteration is a crucial component of progress in any engineering problem. In contrast, our iteration loops towards AGI operate at considerably lower frequencies. This point is also made by Ajeya Cotra in her biological anchors report, and it's why she chooses to cut the software progress speed estimates from Hernandez and Brown (2020) in half when computing her AGI timelines.

    Such an adjustment seems warranted here, but I think the way Cotra does it is not very principled and certainly doesn't do justice to the importance of the question of software progress.

Overall I agree with your point that training AGI is a different kind of task. I would be more optimistic about progress in a very broad domain such as computer vision or natural language processing translating to progress towards AGI, but I suspect the conversion will still be significantly less favorable than any explicit performance metric would suggest. I would not recommend using point estimates of software progress on the order of a doubling of compute efficiency per year for forecasting timelines.

You would do it in log space (or geometrically). For your example, the answer would be .

Should society eliminate schools?

The question is too vague as it's stated, but I think society should eliminate schools in their present form. This is a rather worthless statement though, at least unless it's fleshed out by a reasonably detailed description of what that alternative world would look like.

I think it would be a substantial win to at least cut down the years of schooling on the margin and replace them with work and/or apprenticeships whenever possible. An uncontroversial example: the fact that physicians and lawyers in the US have to complete a whole separate undergraduate degree before going to medical school or law school seems like a colossal waste of time and resources, and many civilized places in the world get by just fine without this extension.

So on the margin, I think it's good to move in the direction of "eliminating schools". Whether you want to go all the way and what happens if you do is more complicated, though I think there are definitely more promising alternative systems that would qualify. These are more speculative and only of theoretical interest given where we currently are as a society, though.

Should we have more compulsory schooling?

On the margin, I don't see how more compulsory schooling would help with anything useful, and the costs are significant, even aside from the moral concerns with forcing children to go to school et cetera. So the answer here looks fairly overdetermined to be "no" unless marginal years of schooling are shown to have substantial benefits.

Should you send your kids to school?

Depends on the situation. Do the kids want to go to school? Do you think careers that would be the best fit for them require one to go through some formal accreditation process that involves schooling? How feasible it is for you to arrange an alternative to going to school for purposes that are relevant, and what are the costs of not participating in the existing system?

I would put significant weight on the preference of the kids in question here, and I can easily imagine that some of them want to go to school and others don't. A "one size fits all" policy seems inappropriate here.

Should you prefer to hire job candidates who have received more schooling, beyond school's correlation with the g factor?

There are other reasons to prefer such candidates, but it depends on exactly which job you're hiring for. People who are "competent" despite not going to school right now are all highly unusual people in various ways, and they might generally be unusual in a way that makes them poor fits for the specific job you have in mind. So in that case going to school would be a valuable signal above and beyond the correlation with g.

Should we consider the spread of education requirements to be a form of class war by the better-educated against the worse-educated which must be opposed for the sake of the worse-educated and the future of society?

Probably not. I don't see what reason there is to invent such an explanation for the phenomenon of schooling, or what predictive power or utility it would have.

I find it more productive to view schooling and its shortcomings (as many other things) as coordination failures and problems imposed by scarcity than any kind of "class war" by some group against another. Useful thinking about these questions should contend with the coordination issues surrounding signaling etc. and the substantial opportunity cost of having high-quality teachers in too many classrooms.

I wanted to thank you for writing this comment - while I have also been reasonably active on social media about this topic, and playing level 3+ games is sometimes necessary in the real world, I don't think this post actually offers any substantive content that goes beyond "fraud is bad and FTX was involved in fraudulent activities".

I agree that it's not a good fit for LW, though I think the post does fit in the EA Forum given recent events.

In my opinion, the fact that the original tweet, along with Bankman-Fried's other tweets in the same thread, has been deleted in an attempt to shift the narrative is noteworthy in and of itself.

I think the tweet is more likely to be deleted in this way in worlds where the claim in it was made in bad faith rather than not, though perhaps unknown details about FTX's strategy for getting out of this quagmire necessitated such a move even in worlds where the claim had originally been made in good faith. Still, the optics are not good, and I have appropriately updated towards FTX having engaged in dishonest conduct.

I agree with evhub that we don't yet know if the claim was made in bad faith, but we can still make some guesses in the absence of such definitive knowledge.

Load More