Sequences

Linguistic Freedom: Map and Territory Revisted
INVESTIGATIONS INTO INFINITY

Comments

GPT-J token embeddings inhabit a zone in their 4096-dimensional embedding space formed by the intersection of two hyperspherical shells


You may want to update the TLDR if you agree with the comments that indicate that this might not be accurate.

If there's a 100 tokens for snow, it probably indicates that it's a particularly important concept for that language.

For Linear Tomography and Principle Component Analysis, I'm assuming that by unsupervised you mean that you don't use the labels for finding the vector, but that you do use them for determining which sign is true and which is false. If so, this might be worth clarifying in the table.

Agreed. Good counter-example.

I'm very curious as to whether Zac has a way of reformulating his claim to save it.

Well done for writing this up! Admissions like this are hard often hard to write.

Have you considered trying to use any credibility from helping to cofound vast for public outreach purposes?

Does GPT-4 directly handle the image input or is it converted to text by a separate model then fed into GPT-4?

A potential approach to tackle this could be to aim to discover features in smaller batches. After each batch of discovered features finishes learning we could freeze them and only calculate the orthogonality regularisation within the next batch, as well as between the next batch and the frozen features. Importantly we wouldn’t need to apply the regularisation within the already discovered features.


Wouldn't this still be quadratic?

You state that GPT-4 is multi-modal, but my understanding was that it wasn't natively multi-modal. I thought that the extra features like images and voice input were hacked on - ie. instead of generating an image itself it generates a query to be sent to DALLE. Is my understanding here incorrect?

In any case, it could just be a matter of scale. Maybe these kinds of tasks are rare enough in terms of internet data that it doesn't improve the loss of the models very much to be able to model them? And perhaps the instruction fine-tuning focused on more practical tasks?

"Previous post" links to localhost.

Load More