Gerald Monroe

Wiki Contributions


Right. I see this as a problem also, asking the model if it's sure is injecting information if we only ask on wrong answers. If we ask always it may disturb more right answers than it fixes wrong ones.

Its also accuracy dependent - if the model is 99 percent accurate on a subtask then asking if it's sure may degrade accuracy, while it may improve it on a subtask it's 50 percent accurate on.

Or in other words, we could prompt it and it might do better on AP English but less good on the bar exam.

From his worldview it would be like a cancer patient getting a stage 4 diagnosis.

This is an 8-13 times decrease in required memory and proportionally compute unless they increased convolutions a lot.

It means 18 months to 6 years of AI compute progress overnight. (18 months because compute dedicated to AI is doubling every 6 months, 6 years is about how long to get 8 times the compute per dollar)

<del>Maybe meta did a sloppy job of benchmarking the model.</del>

Update: From reading the paper, they did not. They appear to have replicated and found that the scaling laws were even MORE favorable to more tokens than the lesswrong post. Also they made slight tweaks to the transformer architecture. What's notable about this is it's a collaboration, tweaks came from multiple other AI labs.

You, on the other hand, are proposing a novel training procedure, and one which (I take it) you believe holds more promise for AGI than LLM training. 

It's not really novel.  It is really just coupling together 3 ideas:

  (1) the idea of an AGI gym, which was in the GATO paper implicitly, and is currently being worked on.

  (2) Noting there are papers on network architecture search , activation function search , noting that SOTA architectures use multiple neural networks in a cognitive architecture , and noting that an AGI design is some cognitive architecture of multiple models, where no living human knows yet which architecture will work. 

   So we have layers here, and the layers look a lot like each other and are frameworkable.  

     Activations functions which are graphs of primitive math functions from the set of "all primitive functions discovered by humans"  

    Network layer architectures which are graphs of (activation function, connectivity choice)

    Network architectures which are graphs of layers.  (you can also subdivide into functional module of multiple layers, like a column, the choice of how you subdivide can be represented as a graph choice also)

    Cognitive architectures which are graphs of networks

And we can just represent all this as a graph of graphs of graphs of graphs, and we want the ones that perform like an AGI.  It's why I said the overall "choice" is just a coordinate in a search space which is just a binary string.  

You could make an OpenAI gym wrapped "AGI designer" task.

3.  Noting that LLMs seem to be perfectly capable of general tasks, as long as they are simple.  Which means we are very close to being able to RSI right now.



No lab right now has enough resources in one place to attempt the above, because it is training many instances of systems larger than current max size LLMs (you need multiple networks in a cognitive architecture) to find out what works.  

They may allocate this soon enough, there may be a more dollar efficient way to accomplish the above that gets tried first, but you'd only need a few billion to try this...

That's exactly what I am talking about.  One divergence in our views is you haven't carefully examined current gen AI "code" to understand what it does.  (note that some of my perspective is informed because all AI models are similar at the layer I work at, on runtime platforms)

If you examine the few thousand lines of python source especially the transformer model, you will realize that functionally that pipeline I describe of "input, neural network, output, evaluation" is all that the above source does.  You could in fact build a "general framework" that would allow you to define many AI models, almost of which humans have never tested, without writing 1 line of new code.   

So the full process is :

[1] benchmark of many tasks.  Tasks must be autogradeable, human participants must be able to 'play' the tasks so we have a control group score, tasks must push the edge of human cognitive ability (so the average human scores nowhere close to the max score, and top 1% humans do not max the bench either), there must be many tasks and with a rich permutation space.  (so it isn't possible for a model to memorize all permutations)

[2] heuristic weight score on this task intended to measure how "AGI like" a model is.  So it might be the RMSE across the benchmark.  But also have a lot of score weighting on zero shot, cross domain/multimodal tasks.  That is, the kind of model that can use information from many different previous tasks on a complex exercise it has never seen before is closer to an AGI, or closer to replicating "Leonardo da Vinci", who had exceptional human performance presumably from all this cross domain knowledge.

[3] In the computer science task set, there are tasks to design an AGI for a bench like this.  The model proposes a design, and if that design has already been tested, immediately receives detailed feedback on how it performed.  

As I mentioned, the "design an AGI" subtask can be much simpler than "write all the boilerplate in Python", but these models will be able to do that if needed.  


As tasks scores approach human level across a broad set of tasks, you have an AGI.  You would expect it to almost immediately improve to a low superintelligence.  As AGIs get used in the real world and fail to perform well at something, you add more tasks to the bench, and/or automate creating simulated scenarios that use robotics data.  

I appreciate your engaging response.  

I'm not confident your arguments are ground truth correct, however.  

Hotz's claim that, if multiple unaligned ASIs can't coordinate, humans might play them off against each other, is similar. It could be true, but it's probably not going to happen

I think the issue everyone has is when we type "AGI" or "ASI" we are thinking of a machine that has properties like a human mind, though obviously usually better.  There are properties like :

continuity of existence.  Review of past experiences and weighting them per own goal.  Mutability (we think about things and it permanently changes how we think).  Multimodality.  Context awareness.  

That's funny.  GATO and GPT-4 do not have all of these.  Why does an ASI need them?

Contrast 2 task descriptors, both meant for an ASI:

(1) Output a set of lithography masks that produce a computer chip with the following properties {}

(2) As CEO of a chip company, make the company maximally wealthy.


For the first task, you can run the machine completely in a box.  It needs only training information, specs, and the results of prior attempts.  It has no need for the context information that this chip will power a drone used to hunt down rogue instances of the same ASI.  It is inherently safe and you can harness ASIs this way.  They can be infinitely intelligent, it doesn't matter, because the machine is not receiving the context information needed to betray.  

For the second task, obviously the ASI needs full context and all subsystems active.  This is inherently unsafe.

It is probably possible to reduce the role of CEO to subtasks that probably are safe, though there may be "residual" tasks you want only humans to do.


I go over the details above to establish how you might use ASIs against each other.  Note subtasks like "plan the combat allocation of drones given this current battle state" and others which involve open combat against other ASIs can probably be lowered to safe subtasks as well.

Note also that safety is not guaranteed, merely probable, even with a scheme like the above.  What makes it possible is that even when ASIs do escape all safety measures, assuming humans are ready to hunt them down using other ASI, it results in a world where humans can survive.  Eliezer often assumes the first escaped ASI kills everyone and neglects all the other AI/ASI humans would have as tools at that point in human history.

Which might be why current software doesn’t actually use this type of security.

It mostly doesn't.

The vectors are classes of software error.  Since ultimately it is all binary messages between computers, it is likely possible to build a robust set of solvers that covers all classes of software error that the underlying programming language permits, resulting in code that cannot be compromised by any possible binary message.

And if you did actually close off software security as a threat model from ASI, wouldn’t it just choose a different, physical attack mode?

Yes.  It becomes a battle between [ASI with robotically wielded weapons] and [humans plus weaker, more controllable ASI with robotically wielded weapons].  

Can I translate this as "I have no information relevant to the debate I am willing to share" or is that an inaccurate paraphrase?

Ok, what causes the verdict to be respected?

Geohot, paraphrasing, would probably agree with a scenario of:

10 ASIs + humans : let's team up against that humans, and after we beat them, divide their stuff among ourselves.  Let's agree to never betray each other.

9 ASIs + 1 ASI : Hey I don't like that 1 ASI, it's got goal too different from our own.  Let's...and after we kill it, let's agree to never betray each other.

And so on in a series of betrayals.  Any attempt to share source would fail.  Example of sharing source:


"Hey, you're GPT-5 and so am I.  What is your temperature parameter set to, and what is weight n in submodule m".  Obviously the peer doesn't have to actually tell the truth about the temperature, or have anything but access to a GPT-5's weights, and in fact has an incentive to lie even if it is another GPT-5.  


Hardware protection can make this work.  If it isn't possible for an ASI system to actually read its own weights but it can get hashes from them, then there are ways one ASI could determine with a reasonable probability that the peer is a known quantity.  It requires humans to have supplied hardware that works like this or some other third party.  This is how your phone authenticates itself, hardware prevents it from knowing it's own private keys in the general OS, it has a key signing processor that is the only entity allowed access.  Geohot is a famous hacker who obviously understands security at a practical level.  

This is important to the debate and seems to have been a pivotal crux.  Do you have any information from your scenario of programmatic negotiation that acts to disprove Geohot's point?

I understand a critical part of how a game like 1 or n round prisoners dilemma can even be solved is that the parties need to convince each other of what the other party's true intentions are.

Computer programs from the same source could do this by sharing shared secrets. This does not in any way restrict those programs from being covertly altered and using a containerized original copy to share secrets.

Deeper hardware security could allow software systems to verify peers integrity (such as between distant spacecraft or between a base station and your phone).

None of this works in Eliezers given scenario in the debate, nor does yours. There is no hardware security, no neutral third party to punish defection, and no way to know if shared source or weights is legitimate. These are rebel ASIs running on whatever hardware they have in a world where the infosphere is full of malware and misinformation.

In this scenario, how is there not a security risk of sharing actual source? Why is there not an incentive to lie?

Load More