On A List of Lethalities

Posted on June 13, 2022 by TheZvi

Response to (Eliezer Yudkowsky): A List of Lethalities.

Author’s Note: I do not work in AI Safety, lack technical domain knowledge and in many ways am going to be wrong. I wasn’t going to write this to avoid potentially wasting too much time all around without having enough to offer, and for fear of making stupid errors, but it was clear that many people thought my response would be valuable. I thank those whose anonymous sponsorship of this post both paid for my time and made me update that the post was worth writing. I would be happy for this to happen again in the future.

Eliezer has at long last delivered the definitive list of Eliezer Rants About Why AGI Will Definitely Absolutely For Sure Kill Everyone Unless Something Very Unexpected Happens.

This is excellent. In the past we had to make do with makeshift scattershot collections of rants. Now they are all in one place, with a helpful classification system. Key claims are in bold. We can refer, consider and discuss them.

It would be an even better post if it were more logically organized, with dependencies pointed out and mapped and so on.

One could also propose making it not full of rants, but I don’t think that would be an improvement. The rants are important. The rants contain data. They reveal Eliezer’s cognitive state and his assessment of the state of play. Not ranting would leave important bits out and give a meaningfully misleading impression.

I am reminded of this comment of mine that I dug out of the archives, on another Eliezer post that was both useful and enthused with this kind of attitude:

Most of this applies again. Eliezer says explicitly that the alternative post would have been orders of magnitude harder to write, and that the attitude is important information.

I would expand this. Not only are the attitude and repetition important information in terms of allowing you to understand the algorithm generating the post and create a better Inner Eliezer, but they also are importantly illustrating the cognitive world in which Eliezer is operating.

The fact that this is the post we got, as opposed to a different (in many ways better) post, is a reflection of the fact that our Earth is failing to understand what we are facing. It is failing to look the problem in the eye, let alone make real attempts at solutions.

Eliezer is not merely talking to you, yes you (with notably rare exceptions) when he does this. He is also saying model the world as if it really is forcing him to talk like this.

The only point above that doesn’t seem to apply here is #9.

The core message remains the most important thing. Conveying the core message alone would be a big win. But here it also matters that people grasp as many of the individual points as possible, especially whichever of them happens to be the one bottlenecking their understanding of the scope and difficulty of the problem or allowing them to rationalize.

Thus there needs to be a second version of the document that someone else writes that contains the properly organized details without the ranting, for when that is what is needed.

In terms of timelines, only ‘endgame’ timelines (where endgame means roughly ‘once the first team gets the ability to create an AGI capable of world destruction’) are mentioned in this post, because they are a key part of the difficulty and ‘how long it takes to get there’ mostly isn’t. Talk of when AGI will kill us is distinct from talk of how or why it will, or whether it will be built. That stuff was the subject of that other post, and it doesn’t really matter in this context.

It is central to the doom claim that once one group can build an AGI, other groups also rapidly gain this ability. This forces humanity to solve the problem both on the first try and also quickly, a combination that makes an otherwise highly difficult but potentially solvable problem all but impossible. I find this plausible but am in no way confident in it.

I will also be assuming as a starting point the ability of at least one group somewhere to construct an AGI on some unspecified time frame.

Goals

The goal of the bulk of the post is both to give my reactions to the individual claims and to attempt to organize them into a cohesive whole, and to see where my model differs from Eliezer’s even after I get access to his.

Rather than put the resulting summary results at the bottom, I’m going to put them at the top where they’ll actually get read, then share my individual reasoning afterwards because actually reasoning this stuff out out loud seems like The Way.

Summary of List, Agreements and Disagreements

Some of what the post is doing is saying ‘here is a particular thing people say that is stupid and wrong but that people use as an excuse, and here is the particular thing I say in response to that.’ I affirm these one by one below.

More centrally, the post is generated by a very consistent model of the situation, so having thought about each individual statement a summary here is more like an attempt to recreate the model generating the points rather than the points themselves.

To the extent that I am wrong about the contents of the generative model, that seems important to clarify.

I would say my takeaways are here, noting they are in a different order than where they appear in the post:

M1. Creating a powerful unsafe AGI quickly kills everyone. No second chances.

M2. The only known pivotal acts that stop the creation of additional powerful AGIs all require a powerful AGI. Weak systems won’t get it done.

M3. AGI will happen mostly on schedule unless stopped by such a pivotal act, whether or not it is safe. So not only do we only get one chance to solve the problem of alignment, we don’t get much time. Within two years of the first group’s ability to build an (unsafe) AGI, five more groups can do so including Facebook. Whoops.

M4. Powerful AGI is dramatically different and safety strategies that work on weak AGIs won’t work on powerful ones.

M5. Most safety ideas and most safety work are known to be useless and have no value in terms of creating safe powerful AGIs. All the usual suspects don’t work for reasons that are listed, and there are many reasons the problem is extremely difficult.

M6. We have no plan for how to do anything useful. No one who isn’t Eliezer seems capable of even understanding the problems well enough to explain them, and no one who can’t explain the problems is capable of nontrivially useful AI Safety work.

M7 (not explicitly said but follows and seems centrally important). Most attempts to create AI Safety instead end up creating AI capability work, and the entire attempt has so far been net negative, and is likely net negative even if you exclude certain large obviously negative projects.

M8. We have no idea what the hell is going on with these systems. Even if we did, that would break down once we started using observations while training AIs.

M9.The problem would still be solvable if a failed attempt didn’t kill everyone and we had enough time. We get neither. Attempts that can’t kill you aren’t real attempts and don’t tell you if your solution works.

M10 (let’s just say it). Therefore, DOOM.

That is my summary. As Eliezer notes, different people will need to hear or learn different parts of this, and would write different summaries.

Based on this summary, which parts do I agree with? Where am I skeptical?

For all practical purposes I fully agree with M1, M4, M5, M7 (!) and M9.

For all practical purposes I mostly agree with M2, M6 and M8, but am less confident that the situations are as extreme as described.

For M2 I hold out hope that an as-yet-unfound path could be found.

For M6 I do not think we can be so confident there aren’t valuable others out there (although obviously not as many as we need/want).

For M8, I do not feel I am in a position to evaluate our future ability to look inside the inscrutable matrixes enough to have so little hope.

For M10, I agree that M10 follows from the M1-M9, and unconditionally agree that there is a highly unacceptable probability of doom even if all my optimistic doubts are right.

I am least convinced of M3.

M3 matters a lot. M3 is stated most directly in Eliezer’s #4, where a proof is sketched:

#4. We can’t just “decide not to build AGI” because GPUs are everywhere, and knowledge of algorithms is constantly being improved and published; 2 years after the leading actor has the capability to destroy the world, 5 other actors will have the capability to destroy the world.

In particular, I question the assumption that incremental improvement in the knowledge of algorithms and access to GPUs is sure to be sufficient to generate AGI, or that there is no plausible hard step or secret sauce that could buy you a substantial lead without being published or stolen immediately in a way that invalidated that lead, and that there is no possibility of a flat out ‘competence gap’ or capacity gap of some kind that matters, and that essentially unlimited numbers of additional efforts will necessarily be close behind.

This also seems closely related to #22’s claim that there is a simple core to general intelligence, which I am also not yet convinced about.

Thus, I am neither convinced that doom is coming especially quickly, nor that it will involve an AGI that looks so much like our current AIs, nor am I convinced that the endgame window will be as short as the post assumes.

I do agree that this scenario is possible, and has non-trivial probability mass. That is more than enough to make the current situation unacceptable, but it is important to note where one is and is not yet convinced.

I do agree that you likely don’t know how much time you have, even if you think you may have more time.

I strongly agree that creating an aligned AI is harder, probably much harder, than creating an unaligned AI, that it requires additional work and additional time if it can be done at all, and that if it needs to be done both quickly and without retries chances of success seem extremely low.

I have a lot of other questions, uncertainties, brainstorms and disagreements in the detail section below, but those are the ones that matter for the core conclusions and implications.

Even if those ‘optimistic doubts’ proved true, mostly it doesn’t change what needs to be done or give us an idea of how to do it.

Preamble

-3: Yes, both the orthogonality thesis and instrumental convergence are true.

-2: When we say Alignment at this point we mean something that can carry out a pivotal task that prevents the creation of another AGI while having less than a 50% chance of killing a billion people. Anything short of mass death, and we’ll take it.

-1: The problem is so difficult because we need to solve the problem on the first critical try on a highly limited time budget. The way humans typically solve hard problems involves taking time and failing a lot, which here would leave us very dead. If we had time (say 100 years) and unlimited retries the problem is still super hard but (probably?) eminently solvable by ordinary human efforts.

Section A

1. AGI will not be upper-bounded by human ability or human learning speed. Things much smarter than human would be able to learn from less evidence than humans require.
…
It is not naturally (by default, barring intervention) the case that everything takes place on a timescale that makes it easy for us to react.

Yes, obviously.

This is a remarkably soft-pedaling rant. Given sufficient processing power, anything the AGI can learn from what data it has is something it already knows. Any skill it can develop is a skill it already has.

2. A cognitive system with sufficiently high cognitive powers, given any medium-bandwidth channel of causal influence, will not find it difficult to bootstrap to overpowering capabilities independent of human infrastructure.
…
Losing a conflict with a high-powered cognitive system looks at least as deadly as “everybody on the face of the Earth suddenly falls over dead within the same second”.

Yes, obviously.

If you don’t like the nanotech example (as some don’t), ignore it. It’s not important. A sufficiently intelligent system that is on the internet or can speak to humans simply wins, period. The question is what counts as sufficiently intelligent, not whether there is a way.

3. We need to get alignment right on the ‘first critical try’ at operating at a ‘dangerous’ level of intelligence, where unaligned operation at a dangerous level of intelligence kills everybody on Earth and then we don’t get to try again.

Yes, obviously this is the default outcome.

If it’s smart enough to figure out how to do things that prevent other AGIs it is also almost certainly smart enough to figure out how to kill us and by default that is going to happen because it makes it easier to achieve the AGI’s goals whatever they are.

I can see arguments for why the chance you get a second shot is not zero, but it is very low.

4. We can’t just “decide not to build AGI” because GPUs are everywhere, and knowledge of algorithms is constantly being improved and published; 2 years after the leading actor has the capability to destroy the world, 5 other actors will have the capability to destroy the world.

This is NOT obvious to me.

This is making assumptions about what physically results in AGI and how information develops and spreads. I notice I don’t share those assumptions.

It seems like this is saying either that there are no ‘deep insights’ left before AGI, or that any such deep insights will either (A) inevitably happen in multiple places one after another or (B) will inevitably leak out quickly in a form that can be utilized.

It also says that there won’t be a big ‘competence gap’ between the most competent/advanced group and 6th such group, so within 2 years the others will have caught up. That there won’t be any kind of tacit knowledge or team skill or gap in resources or willingness to simply do the kind of thing in question at the sufficient level of scale, or what have you.

I do not see why this should be expected with confidence.

Yes, we have seen AI situations in which multiple groups were working on the same problem, most recently image generation from a text prompt, and finished in similar time frames. It can happen, especially for incremental abilities that are mostly about who feels like spending compute and manpower on improving at a particular problem this year instead of last year or next year. And yes, we have plenty of situations in which multiple start-ups were racing for a new market, or multiple scientists were racing for some discovery, or whatnot.

We also have plenty of situations in which there was something that could have been figured out at any time, and it just kind of wasn’t for quite a while. Or where something was being done quite stupidly and badly for a very long time. Or where someone figured something out, tried to tell everyone about their innovation, and everyone both ignored them and didn’t figure it out on their own for a very long time.

Certainly a substantial general capacity advantage, or a capacity advantage in the place that turns out to matter, seems highly plausible to me.

From his other writings it is clear that a lot of this is Eliezer’s counting on the code being stolen and that it will be possible to remove whatever safeties are in place. I agree with the need for real security to prevent this when the time comes and the worry that scale may make such security unrealistic and expensive, but also this assumes a kind of competence from the people knowing to steal the code, and also a competence that they can use what they steal, whereas I’m done assuming such competencies will exist at all.

I’m not saying the baseline scenario here is impossible or even all that unlikely, but it seems quite possible for it not to be the case, or at least for the numbers quoted above to not be.

That doesn’t solve the problem of the underlying dynamic. There is still some time limit. Even if there is a good chance that you can indeed ‘decide not to build AGI’ for a while, there is still a continuous risk that you are wrong about that, and there are still internal pressures not to wait for other reasons, and all that.

5. We can’t just build a very weak system, which is less dangerous because it is so weak, and declare victory; because later there will be more actors that have the capability to build a stronger system and one of them will do so. I’ve also in the past called this the ‘safe-but-useless’ tradeoff, or ‘safe-vs-useful’. People keep on going “why don’t we only use AIs to do X, that seems safe” and the answer is almost always either “doing X in fact takes very powerful cognition that is not passively safe” or, even more commonly, “because restricting yourself to doing X will not prevent Facebook AI Research from destroying the world six months later”.

Fundamentally, yes. You either do a pivotal act that stops other AGIs from being constructed or you don’t. Doing one requires non-safe cognition. Not doing one means someone else creates non-safe cognition. No good.

6. We need to align the performance of some large task, a ‘pivotal act’ that prevents other people from building an unaligned AGI that destroys the world. While the number of actors with AGI is few or one, they must execute some “pivotal act”, strong enough to flip the gameboard, using an AGI powerful enough to do that. It’s not enough to be able to align a weak system – we need to align a system that can do some single very large thing. The example I usually give is “burn all GPUs”.
…

Yes. I notice I skipped ahead to this a few times already. I probably would have moved the order around.

It takes a lot of power to do something to the current world that prevents any other AGI from coming into existence; nothing which can do that is passively safe in virtue of its weakness.
7. There are no pivotal weak acts.

I am not as convinced that there don’t exist pivotal acts that are importantly easier than directly burning all GPUs (after which I might or might not then burn most of the GPUs anyway). There’s no particular reason humans can’t perform dangerous cognition without AGI help and do some pivotal act on their own, our cognition is not exactly safe. But if I did have such an idea that I thought would work I wouldn’t write about it, and it most certainly wouldn’t be in the Overton window. Thus, I do not consider the failure of our public discourse to generate such an act to be especially strong evidence that no such act exists.

8. The best and easiest-found-by-optimization algorithms for solving problems we want an AI to solve, readily generalize to problems we’d rather the AI not solve

Yes, obviously.

9. The builders of a safe system, by hypothesis on such a thing being possible, would need to operate their system in a regime where it has the capability to kill everybody or make itself even more dangerous, but has been successfully designed to not do that. Running AGIs doing something pivotal are not passively safe, they’re the equivalent of nuclear cores that require actively maintained design properties to not go supercritical and melt down.

Yes, obviously, for the combined human-AI system doing the pivotal thing. Again, one can imagine putting all the unsafe cognition ‘into the humans’ in some sense.

Section B.1

10. On anything like the standard ML paradigm, you would need to somehow generalize optimization-for-alignment you did in safe conditions, across a big distributional shift to dangerous conditions.
…
Powerful AGIs doing dangerous things that will kill you if misaligned, must have an alignment property that generalized far out-of-distribution from safer building/training operations that didn’t kill you. This is where a huge amount of lethality comes from on anything remotely resembling the present paradigm.
…
10a. Note that anything substantially smarter than you poses a threat given any realistic level of capability. Eg, “being able to produce outputs that humans look at” is probably sufficient for a generally much-smarter-than-human AGI to navigate its way out of the causal systems that are humans, especially in the real world where somebody trained the system on terabytes of Internet text, rather than somehow keeping it ignorant of the latent causes of its source code and training environments.

Yes. 10 seems transparently and obviously true, yet it does need to be said explicitly.

I am labeling 10a because I consider it an important sub-claim, one that I am highly confident is true. A much-smarter-than-human AGI capable of getting its text read by humans will be able to get those humans to do what it wants, period. This is one of those no-it-does-not-seem-wise-to-explain-why-I-am-so-confident-this-is-true situations so I won’t, but I am, again, very confident.

11. There is no pivotal act this weak; there’s no known case where you can entrain a safe level of ability on a safe environment where you can cheaply do millions of runs, and deploy that capability to save the world and prevent the next AGI project up from destroying the world two years later. Pivotal weak acts like this aren’t known, and not for want of people looking for them.
…
You don’t get 1000 failed tries at burning all GPUs – because people will notice, even leaving out the consequences of capabilities success and alignment failure.

There certainly isn’t a publicly known such act that could possibly be implemented, and there has definitely been a lot of public searching for one. It doesn’t seem impossible that an answer exists and that those who find it don’t say anything for very good reasons. Or that ‘a lot of trying to do X and failing’ is surprisingly weak evidence that X is impossible, because the efforts are correlated in terms of their blind spots.

12. Operating at a highly intelligent level is a drastic shift in distribution from operating at a less intelligent level, opening up new external options, and probably opening up even more new internal choices and modes. Problems that materialize at high intelligence and danger levels may fail to show up at safe lower levels of intelligence, or may recur after being suppressed by a first patch.

Yes, yes, we said that already.

13. Many alignment problems of superintelligence will not naturally appear at pre-dangerous, passively-safe levels of capability. Consider the internal behavior ‘change your outer behavior to deliberately look more aligned and deceive the programmers, operators, and possibly any loss functions optimizing over you’. This problem is one that will appear at the superintelligent level; if, being otherwise ignorant, we guess that it is among the median such problems in terms of how early it naturally appears in earlier systems, then around half of the alignment problems of superintelligence will first naturally materialize after that one first starts to appear.

On the headline statement, yes, yes, again, didn’t we say that already?

The example is definitely a danger at the superhuman level, but it seems like it is also a danger at the human level. Have… you met humans? Also have you met dogs and cats, definitely sub-human intelligences? This is not an especially ‘advanced’ trick.

This makes sense, because figuring out that a problem that doesn’t exist at human levels will exist at superhuman levels seems difficult by virtue of the people thinking about the problem being humans. We can figure out things that current systems maybe aren’t doing, like ‘pretend to be aligned to fool creators’ because we are intelligent systems that do these things. And that seems like a problem it would be very easy to get to materialize early, in an actually safe system, because again existence proof and also it seems obvious how to do it. That doesn’t mean I know how to solve the problem, but I can make it show up.

What are the problems that don’t show up in sub-human AI systems and also don’t show up in humans because we can’t think of them? I don’t know. I can’t think of them. That’s why they don’t show up.

Thus, to the extent that we can talk about there being distinct alignment problems like this that one can try to anticipate and solve, the nasty ones that only show up in the one-shot final exam are going to be things that we are not smart enough to think of and thus we can’t prepare for them. Which means we need a general solution, or else we’re hoping there are no such additional problems.

14. Some problems, like ‘the AGI has an option that (looks to it like) it could successfully kill and replace the programmers to fully optimize over its environment’, seem like their natural order of appearance could be that they first appear only in fully dangerous domains.
…
Trying to train by gradient descent against that behavior, in that toy domain, is something I’d expect to produce not-particularly-coherent local patches to thought processes, which would break with near-certainty inside a superintelligence generalizing far outside the training distribution and thinking very different thoughts. Also, programmers and operators themselves, who are used to operating in not-fully-dangerous domains, are operating out-of-distribution when they enter into dangerous ones; our methodologies may at that time break.

Being able to somehow take control and override the programmers to take control of the reward function is, again, something that humans essentially do all the time. It is coming. The question is will fixing it in a relatively safe situation lead to a general solution to the problem?

My presumption is that if someone goes in with the goal of ‘get this system to stop having the problem’ the solution found has almost zero chance of working in the dangerous domain. If your goal is to actually figure out what’s going on in a way that might survive, then maybe there’s some chance? Still does not seem great. The thing we look to prevent may not meaningfully interact with the thing that is coming, at all.

15. Fast capability gains seem likely, and may break lots of previous alignment-required invariants simultaneously. Given otherwise insufficient foresight by the operators, I’d expect a lot of those problems to appear approximately simultaneously after a sharp capability gain. See, again, the case of human intelligence.

Yes.

When I said ‘yes’ above I wasn’t at all relying on the example of human intelligence, or the details described later, but I’m going to quote it in full because this is the first time it seems like an especially valuable detailed explanation.

We didn’t break alignment with the ‘inclusive reproductive fitness’ outer loss function, immediately after the introduction of farming – something like 40,000 years into a 50,000 year Cro-Magnon takeoff, as was itself running very quickly relative to the outer optimization loop of natural selection. Instead, we got a lot of technology more advanced than was in the ancestral environment, including contraception, in one very fast burst relative to the speed of the outer optimization loop, late in the general intelligence game. We started reflecting on ourselves a lot more, started being programmed a lot more by cultural evolution, and lots and lots of assumptions underlying our alignment in the ancestral training environment broke simultaneously.
(People will perhaps rationalize reasons why this abstract description doesn’t carry over to gradient descent; eg, “gradient descent has less of an information bottleneck”. My model of this variety of reader has an inside view, which they will label an outside view, that assigns great relevance to some other data points that are not observed cases of an outer optimization loop producing an inner general intelligence, and assigns little importance to our one data point actually featuring the phenomenon in question. When an outer optimization loop actually produced general intelligence, it broke alignment after it turned general, and did so relatively late in the game of that general intelligence accumulating capability and knowledge, almost immediately before it turned ‘lethally’ dangerous relative to the outer optimization loop of natural selection. Consider skepticism, if someone is ignoring this one warning, especially if they are not presenting equally lethal and dangerous things that they say will go wrong instead.)

I both agree that the one data point is not being given enough respect, and also don’t think you need the data point. There are going to be a whole lot of things that are true about a system when the system is insufficiently intelligent/powerful that won’t be true when the system gets a lot more intelligent/powerful and some of them are things you did not realize you were relying upon. It’s going to be a problem.

Section B.2

16. Even if you train really hard on an exact loss function, that doesn’t thereby create an explicit internal representation of the loss function inside an AI that then continues to pursue that exact loss function in distribution-shifted environments
…
outer optimization even on a very exact, very simple loss function doesn’t produce inner optimization in that direction.
This happens in practice in real life, it is what happened in the only case we know about, and it seems to me that there are deep theoretical reasons to expect it to happen again

Yes. It won’t do that, not if your strategy is purely to train on the loss function. There is no reason to expect it to happen. So don’t do that. Need to do something else.

17. In the current optimization paradigm there is no general idea of how to get particular inner properties into a system, or verify that they’re there, rather than just observable outer ones you can run a loss function over.

I think we have some ability to verify if they are there? As in, Chris Olah and a few others have made enough progress that at least some current-paradigm systems for which they can identify some of the inner properties of the system, with expectation of more in the future. They have no idea how to choose or cause those properties that I know about, but there’s at least some hope for some observability.

If you can observe it, you can at least in theory train on it as well, although that risks training the AI to make your observation method stop working? As in, suppose you have a classifier program. From my conversations, it sounds like at least sometimes you can say ‘this node represents whether there is a curve here’ or whatever. If you can do that, presumably (at least in theory) you can then train or do some sort of selection on whether or not that sort of thing is present and in what form, and iterate, and you can have at least some say over how the thing you eventually get is structured within the range of things that could possibly emerge from your loss function, or something. There are other things I can think of to try as well, which of course are probably obvious nonsense, or worse nonsense just non-obvious enough to get us all killed, but you never know.

18. There’s no reliable Cartesian-sensory ground truth (reliable loss-function-calculator) about whether an output is ‘aligned’, because some outputs destroy (or fool) the human operators and produce a different environmental causal chain behind the externally-registered loss function.

Yes, that is a thing. You are in fact hoping that it importantly doesn’t optimize too well for what reward signal it gets and instead optimizes on your intent. That seems hard.

19. More generally, there is no known way to use the paradigm of loss functions, sensory inputs, and/or reward inputs, to optimize anything within a cognitive system to point at particular things within the environment – to point to latent events and objects and properties in the environment, rather than relatively shallow functions of the sense data and reward.

Yes, I did realize that you’d said this already, but also it’s seeming increasingly weird and like something you can overcome? As in, sure, you’ll need to do something innovative to make this work and it’s important to note that a lot of work has been done and no one’s done it yet and that is quite a bad sign, but… still?

20. Human operators are fallible, breakable, and manipulable. Human raters make systematic errors – regular, compactly describable, predictable errors. To faithfully learn a function from ‘human feedback’ is to learn (from our external standpoint) an unfaithful description of human preferences, with errors that are not random (from the outside standpoint of what we’d hoped to transfer). If you perfectly learn and perfectly maximize the referent of rewards assigned by human operators, that kills them. It’s a fact about the territory, not the map – about the environment, not the optimizer – that the best predictive explanation for human answers is one that predicts the systematic errors in our responses, and therefore is a psychological concept that correctly predicts the higher scores that would be assigned to human-error-producing cases.

I worry that there’s a leap in here and it’s taking the principle of ‘almost every possible AGI kills you’ too far. In general, I am totally on board with the principle that almost every possible AGI kills you. Most of the time that the post says ‘so it kills you’ this is definitely the thing that happens next if the previous things did indeed take place.

If by ‘fool the operators’ we mean things like ‘take control of the operators and implant a chip in their head’ then yes, there is that, but that doesn’t seem like what is being described here. What is being described here is your friendly neighborhood AGI that wants you to like its output, to really like it, so it tells you what you will be happy to hear every time even if the results would be quite bad.

Does that kill you (as in, kill everyone)?

It certainly could kill you. Certainly it will intentionally choose errors over correct answers in some situations. But so will humans. So will politicians. We don’t exactly make the best possible decisions or avoid bias in our big choices. This seems like a level of error that is often going to be survivable. It depends on how the humans rely on it and if the humans know to avoid situations in which this will get them killed.

I believe that if you gave Eliezer or myself the job of using an AGI that was aligned exactly to the evaluations of its output by a realistically assembled team of human evaluators on an individual answer basis, as in it wasn’t trained to play a long game to get stronger future evaluations and was merely responding to human bias, that this would be good enough for Eliezer’s threshold of alignment – we would be a favorite to successfully execute a pivotal act without killing a billion or more people.

That doesn’t mean this isn’t a problem. This is much worse a scenario than if the AGI was somehow magically aligned to what we should in some sense rate its output, and this is going to compound with other problems, but solving every problem except this one does seem like it would bring us home.

There’s something like a single answer, or a single bucket of answers, for questions like ‘What’s the environment really like?’ and ‘How do I figure out the environment?’ and ‘Which of my possible outputs interact with reality in a way that causes reality to have certain properties?’, where a simple outer optimization loop will straightforwardly shove optimizees into this bucket.
When you have a wrong belief, reality hits back at your wrong predictions. When you have a broken belief-updater, reality hits back at your broken predictive mechanism via predictive losses, and a gradient descent update fixes the problem in a simple way that can easily cohere with all the other predictive stuff.
In contrast, when it comes to a choice of utility function, there are unbounded degrees of freedom and multiple reflectively coherent fixpoints. Reality doesn’t ‘hit back’ against things that are locally aligned with the loss function on a particular range of test cases, but globally misaligned on a wider range of test cases.
…
21. The central result: Capabilities generalize further than alignment once capabilities start to generalize far.

Yes, although not obviously. The explanation in this bullet point is very non-intuitive to me. That’s assuming I actually grok it correctly, which I think I did after reflection but I’m not sure. It’s certainly not how I would think about or explain the conclusion at all, nor am I convinced the reasoning steps are right.

When you have a wrong belief that causes wrong predictions, you might or might not end up with a loss function that needs correction. It happens if the wrong predictions are inside the training set (or ancestral environment) and also have consequences that impact your loss function, which not all errors do. The argument is some combination of (A) that optimizing for local capabilities is more inclined to produce a generalizable solution than optimizing for local alignment, and (B) that you are likely to get alignment ‘wrong’ via aligning to a proxy measure in a way that will prove very wrong outside the training set and get you killed and will be in a utility function that will be fixed in place, whereas the capabilities can continue to adjust and improve in addition to your proxy measures being less likely to break.

Both arguments do seem largely right, or at least likely enough to be right that we should presume they are probably right in practice when it counts.

22. There’s a relatively simple core structure that explains why complicated cognitive machines work; which is why such a thing as general intelligence exists and not just a lot of unrelated special-purpose solutions; which is why capabilities generalize after outer optimization infuses them into something that has been optimized enough to become a powerful inner optimizer. The fact that this core structure is simple and relates generically to low-entropy high-structure environments is why humans can walk on the Moon. There is no analogous truth about there being a simple core of alignment, especially not one that is even easier for gradient descent to find than it would have been for natural selection to just find ‘want inclusive reproductive fitness’ as a well-generalizing solution within ancestral humans. Therefore, capabilities generalize further out-of-distribution than alignment, once they start to generalize at all.

Probably, but seems overconfident. Certainly natural selection did not find one, but that is far from an impossibility proof. General intelligence turned out to be, in a broad sense, something that could be hill climbed towards, which wasn’t true for some sort of stricter alignment. Or at least, it is not true yet. This is one of those problems that seems like it kind of didn’t come up for natural selection until quite recently.

A simple general core alignment, that fixes things properly in place in a way that matters, could easily have been quite the large handicap over time until very recently by destroying degrees of freedom.

The same way that we don’t need to align our current weaker AIs in ways that would be relevant to aligning strong AIs, nor would there have been much direct benefit to doing so, the same seems like it should hold true for everything made by natural selection until humans, presumably until civilization, and plausibly until industrial civilization or even later than that. At what point were people ‘smart enough’ in some sense, with enough possible out-of-sample plays, where ‘want inclusive reproductive fitness’ as an explicit goal would have started to outcompete the alternatives rather than some of that being part of some sort of equilibrium situation?

(I mean, yes, we do need to align current AIs (that aren’t AGIs) operating in the real world and our failure to do so is causing major damage now, but again at least this is a case of it being bad but not killing us yet.)

It took natural selection quite a long time in some sense to find general intelligence. How many cycles has it had to figure out a simple core of alignment, provided one exists?

We don’t know about a simple core of alignment. One might well not exist even in theory, and it would be good for our plan not to be counting on finding one. Still, one might be out there to be found. Certainly one on the level of complexity of general intelligence seems plausibly out there to be found slash seems highly likely to not have already been found by natural selection if it existed, and I don’t feel our current level of work on the problem is conclusive either – it’s more like there are all these impossible problems it has to solve, which are all the other points, and that’s the primary reason to be pessimistic about this.

23. Corrigibility is anti-natural to consequentialist reasoning; “you can’t bring the coffee if you’re dead” for almost every kind of coffee. We (MIRI) tried and failed to find a coherent formula for an agent that would let itself be shut down (without that agent actively trying to get shut down). Furthermore, many anti-corrigible lines of reasoning like this may only first appear at high levels of intelligence.

Yes. I too have found this to be one of the highly frustrating things to watch people often choose not to understand, or pretend not to understand (or, occasionally, actually not understand).

Corrigibility really, really isn’t natural, it’s super weird, it very much does not want to happen. This problem is very hard, and failing to solve it makes all the other problems harder.

I want to emphasize here, like in a few other places, that 99%+ of all people need to take in the message ‘corrigibility is anti-natural and stupidly hard’ rather than the other way around.

However, I am in sharing my thoughts and reactions and models mode, and while 99% of people need to hear one thing the remaining people end up being rather important, so: while not fooling myself in any way that this isn’t close to impossible, the good news is that I still kind of see this as something that is less impossible than some other impossible things, especially if we follow the highly useful ‘in the one case we know about’ principle and look at humans, we do see some humans who are functionally kind of corrigible in the ways that matter here, and I don’t think it involves having those humans believe a false thing (I mean they do, all humans do anyway, which could be doing a lot of the work, but that doesn’t seem like the central tech here).

The technology (in humans) is that the human values the continued well-functioning of the procedure that generates the decision whether to shut them down more than they care about whether the shut down occurs in worlds where they are shut down. Perhaps because the fact that the humans are shutting them down is evidence that they should be shut down, whereas engineering the humans to shut them down wouldn’t provide that evidence.

They will still do things within the rules of the procedure to convince you not to shut them down, but if you manage to shut them down anyway, they will abide by that decision. And they will highly value passing this feature on to others.

This corrigibility usually has its limits, in particular it breaks down when you talk about making the human dead or otherwise causing them to expect sufficiently dire consequences, either locally or globally.

Is the Constitution a suicide pact? It wouldn’t work if it wasn’t willing to be a little bit a suicide pact. It’s also obviously not fully working in the sense that it isn’t a suicide pact, and almost no one has any intention of letting it become one in a sufficiently obvious pinch. As a fictional and therefore clean example, consider the movie Black Panther – should you let yourself be challenged and shut down in this spot, given the consequences, because the rules are the rules, despite the person you’re putting in charge of those rules clearly having no inclination to care about those rules?

Thus, the utility function that combines ‘the system continuing to persevere is super important’ with the desire for other good outcomes is, under the hood, profoundly weird and rather incoherent, and very anti-natural to consequentialist reasoning. I have no doubt that the current methods would break down if tried in an AGI.

Which makes me wonder the extent to which the consequentialist reasoning is going too far and thus part of the problem that needs to be solved, but I don’t see how to get us out of this one yet, even in theory, without making things much worse.

In any case, I’m sure that is all super duper amateur hour compared to the infinite hours MIRI spent on this particular problem, so while I’m continuing my pattern of not giving up on the problem or declaring it unsolvable it is almost certainly not easy.

24. There are two fundamentally different approaches you can potentially take to alignment, which are unsolvable for two different sets of reasons; therefore, by becoming confused and ambiguating between the two approaches, you can confuse yourself about whether alignment is necessarily difficult.
The first approach is to build a CEV-style Sovereign which wants exactly what we extrapolated-want and is therefore safe to let optimize all the future galaxies without it accepting any human input trying to stop it.
The second course is to build corrigible AGI which doesn’t want exactly what we want, and yet somehow fails to kill us and take over the galaxies despite that being a convergent incentive there.

I am basically a CEV skeptic, in the sense that my model of Eliezer thinks it is impossible to implement on the first try but if you did somehow implement it then it would work. Whereas I think that not only is the problem impossible but also if you solved the impossible problem I am predicting a zero-expected-value outcome anyway. I don’t even think the impossible thing works in theory, at least as currently theorized.

Whereas I’m a mild corrigibility optimist in the sense that I do recognize it’s an impossible problem but it does at least seem like a relatively solvable impossible problem even if attempts so far have not gotten anywhere.

I’m also not convinced that the get-it-right-on-first-try approach has to go through CEV, but details there are both beyond scope of the question here and also I’m likely very out of my depth, so I’ll leave that at that.

I haven’t experienced that much frustration on this particular dilemma, where people don’t know if they’re trying to get things right on the first try or they’re trying to solve corrigibility, but that’s probably because I’ve never fully been ‘in the game’ on this stuff, so I consider that a blessing. I do not doubt the reports of these ambiguations.

Section B.3

25. We’ve got no idea what’s actually going on inside the giant inscrutable matrices and tensors of floating-point numbers. Drawing interesting graphs of where a transformer layer is focusing attention doesn’t help if the question that needs answering is “So was it planning how to kill us or not?”

Yes, at least for now this is my understanding as well.

I have never attempted to look inside a giant inscrutable matrix. Even if we did have some idea what is going on inside in some ways, that does not tell us whether the machine is trying to kill us. And if we could look inside and tell, all we’d be doing is teaching the machine to figure out how to hide from our measurements that it was trying to kill us, or whatever else it was up to that we didn’t like, including hiding that it was hiding anything. So there’s that.

I have heard claims that interpretability is making progress, that we have some idea about some giant otherwise inscrutable matrices and that this knowledge is improving over time. I do not have the bandwidth that would be required to evaluate those claims and I don’t know how much usefulness they might have in the future.

26. Even if we did know what was going on inside the giant inscrutable matrices while the AGI was still too weak to kill us, this would just result in us dying with more dignity, if DeepMind refused to run that system and let Facebook AI Research destroy the world two years later. Knowing that a medium-strength system of inscrutable matrices is planning to kill us, does not thereby let us build a high-strength system of inscrutable matrices that isn’t planning to kill us.

Yes to the bold part. It does tell us one machine not to build, it certainly helps, but it doesn’t tell us how to fix the problem even if we get that test right somehow.

The non-bold part depends on the two-years thesis being true, but follows logically if you think that FAIR is always within two years of DeepMind and so on.

I cannot think of any death I want less than to be killed by Facebook AI research. Please, seriously, anyone else.

27. When you explicitly optimize against a detector of unaligned thoughts, you’re partially optimizing for more aligned thoughts, and partially optimizing for unaligned thoughts that are harder to detect. Optimizing against an interpreted thought optimizes against interpretability.

Yes, obviously, I accidentally covered that already. I see why it had to be said out loud.

28. The AGI is smarter than us in whatever domain we’re trying to operate it inside, so we cannot mentally check all the possibilities it examines, and we cannot see all the consequences of its outputs using our own mental talent. A powerful AI searches parts of the option space we don’t, and we can’t foresee all its options.

Yes to the bold text, obviously, and also yes to the implications by default.

If nothing else, an attempt to check the output of the AGI means that we are checking the output of the AGI, and as I noted previously that means it can communicate with humans, and it is a strong part of my core model that this should be assumed to be sufficient for a sufficiently generally powerful non-aligned AGI to manipulate the humans more generally, no matter the situation in any particular domain, although I can see bandwidth limitations that could make this less obvious slash raise the bar a lot for what would count as sufficiently powerful.

We can’t check all the possibilities it examines, but is it obvious we can’t see the consequences of its outputs using our own mental talent? That is potentially a fundamentally easier problem than generating or evaluating the possibilities.

Consider mathematics, a classic place people attempt to do something ‘safe’ with AGI. It is much easier to verify a proof than it is to generate that same proof, and requires a much lower level of intelligence and compute. It seems entirely plausible that the AGI is vastly better at math than Terrance Tao, can prove things in ways Tao didn’t consider while occasionally cheating a bit on one of the steps, but Tao can still look over the proofs and say ‘yes, that’s right’ when they are right and ‘no, that’s cheating’ when they aren’t, and be right.

There are plenty of more practical, more dangerous domains where that is also the case. Tons of problems are of the form ‘There was essentially zero hope that I would have generated this course of action, but now that you propose it I understand what it would do and why it is or isn’t a good idea.’

Nanotech and protein folding, which is used in the post as the canonical default unsafe thing to do, seem like areas where this is not the case. There are plenty of times when by far the most efficient thing to do, if you trust the AGI, is not to check all the consequences of its output, and it is highly plausible that pivotal acts require trusting the AGI in this way for all solutions we have found so far. The existence of exceptions doesn’t ‘get us out’ of the core problem here, but it seems important to be precise.

29. The outputs of an AGI go through a huge, not-fully-known-to-us domain (the real world) before they have their real consequences. Human beings cannot inspect an AGI’s output to determine whether the consequences will be good.

Yes, obviously, for outputs that are sufficiently relevant to our interests here, and we can’t use the ones where we can know the consequences to know what would happen when we can’t. What we can potentially do with outputs is sometimes know what those particular outputs would do, at the cost of severe limitation, and also again we are reading outputs of an AGI which is a very bad idea if it isn’t aligned.

30. There is no pivotal output of an AGI that is humanly checkable and can be used to safely save the world but only after checking it; this is another form of pivotal weak act which does not exist.

This is the rub of the whole section. There exist outputs that are humanly checkable. There exist outputs that are humanly checkable but not in practice humanly generatable. The claim is that no combination of such outputs can enable a pivotal act.

If true, then performing a pivotal act requires trusting the AGI, which means we will have to trust the AGI, despite having no reason to think this would be anything but the worst possible idea and no path to making it otherwise.

It is clear that no one has figured out how to avoid this, or at least no one willing to talk about it, despite quite a bit of trying. It is highly plausible that there is no solution. I continue not to be convinced there exists no solution.

I also know that if I thought I had such an act, it is highly plausible I would take one look at it and say ‘I am not talking about that in public, absolutely not, no way in hell.’

31. A strategically aware intelligence can choose its visible outputs to have the consequence of deceiving you, including about such matters as whether the intelligence has acquired strategic awareness; you can’t rely on behavioral inspection to determine facts about an AI which that AI might want to deceive you about. (Including how smart it is, or whether it’s acquired strategic awareness.)

Yes, obviously. Same as a human, except (when it matters most) smarter about it. And anything internal you observe also becomes an output that it can do this on, as well.

32. Human thought partially exposes only a partially scrutable outer surface layer. Words only trace our real thoughts. Words are not an AGI-complete data representation in its native style. The underparts of human thought are not exposed for direct imitation learning and can’t be put in any dataset. This makes it hard and probably impossible to train a powerful system entirely on imitation of human words or other human-legible contents, which are only impoverished subsystems of human thoughts; unless that system is powerful enough to contain inner intelligences figuring out the humans, and at that point it is no longer really working as imitative human thought.

Yes, except perhaps for the last bit after the bold.

Humans themselves contain inner intelligences figuring out humans. Relative to other tasks we are remarkably good at this one. If your goal was to train a powerful system, and your method was to have the system do so on language while in some sense figuring out the humans, that doesn’t sound like it means you can’t be imitating human thought? Especially since if the goal was to imitate human words, you’d potentially want to be imitating the human interpretations of humans rather than correctly interpreting the humans, as the important thing, because you’re trying to model what a human would have done next in text and that requires knowing what words would bubble out of their system rather than understanding what’s actually going on around them.

33. The AI does not think like you do, the AI doesn’t have thoughts built up from the same concepts you use, it is utterly alien on a staggering scale. Nobody knows what the hell GPT-3 is thinking, not only because the matrices are opaque, but because the stuff within that opaque container is, very likely, incredibly alien – nothing that would translate well into comprehensible human thinking, even if we could see past the giant wall of floating-point numbers to what lay behind.

Yes. The AI does not think like you do, and 99% of people need to understand this.

But maybe it kind of does? For two reasons.

One is that, again based on my discussions with Chris Olah, and another discussion I had with someone else working on interpretability, to the extent that they did look inside a giant inscrutable matrix it turned out to be surprisingly scrutable, and many of the neurons ‘meant something.’ That’s not as helpful as one would hope, but it is an indication that some of the thinking isn’t alien for the larger values of alien. It’s still going to be more alien than any other humans are thinking, but the scale may not be so staggering in the end.

Which plays into the second reason, which is #22, the claim that there is a core function to general intelligence, which implies the possibility that in some sense we are Not So Different as all that. That’s compared to being completely alien and impossible to ever hope to decipher at all, mind you, not compared to obvious nonsense like ‘oh, you mean it’s like how it’s really hard to understand ancient Egyptians’ or something, yes it is going to be a lot, lot more alien than that.

I continue to be skeptical that getting a general intelligence is that easy, but if it is that easy and follows this naturally, I wonder how much that implies it is (relatively) less alien.

Section B.4

34. Coordination schemes between superintelligences are not things that humans can participate in (eg because humans can’t reason reliably about the code of superintelligences); a “multipolar” system of 20 superintelligences with different utility functions, plus humanity, has a natural and obvious equilibrium which looks like “the 20 superintelligences cooperate with each other but not with humanity”.

Yes. I am convinced that things like ‘oh we will be fine because the AGIs will want to establish proper rule of law’ or that we could somehow usefully be part of such deals are nonsense. I do think that the statement here on its own is unconvincing for someone not already convinced who isn’t inclined to be convinced. I agree with it because I was already convinced, but unlike many points that should be shorter this one should have probably been longer.

35. Schemes for playing “different” AIs off against each other stop working if those AIs advance to the point of being able to coordinate via reasoning about (probability distributions over) each others’ code. Any system of sufficiently intelligent agents can probably behave as a single agent, even if you imagine you’re playing them against each other.

Yes. Not automatically or definitely, but enough of a probably that these plans are non-starters even if they weren’t also non-starters for other reasons as well, which I think they are.

I at least somewhat remember when LessWrong was all about questions like this. That was a long time ago. A more civilized conversation from a more civilized age.

36. AI-boxing can only work on relatively weak AGIs; the human operators are not secure systems.

Yes, the only way to keep a sufficiently strong AGI boxed is to not interact with its output in any way, and even then I wouldn’t be so sure. Stop thinking there is hope here, everyone, please. Alas, my expectation is that the argument here is not going to be convincing to anyone who wasn’t already convinced by now.

Section C

Okay, those are some significant problems, but lots of progress is being made on solving them, right? There’s a whole field calling itself “AI Safety” and many major organizations are expressing Very Grave Concern about how “safe” and “ethical” they are?

Yeah, it’s not going so well. It is in fact going so incredibly poorly that so far the whole thing is quite plausibly vastly net negative, with most funding that has gone into “AI Safety” efforts serving as de facto capabilities research that both speeds things up and divides them and is only serving to get us killed faster. It is going so poorly that Eliezer is writing posts like this with actual no idea what useful things could be done, and when I ask people what could be done if one literally was directing policy for the President of the United States I got essentially no useful suggestions beyond trying to hire away GPU designers (or AI researchers if you have the budget for that) to design solar panels. Which, sure, better than not doing that but that is not a good answer.

37. There’s a pattern that’s played out quite often, over all the times the Earth has spun around the Sun, in which some bright-eyed young scientist, young engineer, young entrepreneur, proceeds in full bright-eyed optimism to challenge some problem that turns out to be really quite difficult. Very often the cynical old veterans of the field try to warn them about this, and the bright-eyed youngsters don’t listen, because, like, who wants to hear about all that stuff, they want to go solve the problem! Then this person gets beaten about the head with a slipper by reality as they find out that their brilliant speculative theory is wrong, it’s actually really hard to build the thing because it keeps breaking, and society isn’t as eager to adopt their clever innovation as they might’ve hoped, in a process which eventually produces a new cynical old veteran. Which, if not literally optimal, is I suppose a nice life cycle to nod along to in a nature-show sort of way.
Sometimes you do something for the first time and there are no cynical old veterans to warn anyone and people can be really optimistic about how it will go; eg the initial Dartmouth Summer Research Project on Artificial Intelligence in 1956: “An attempt will be made to find how to make machines use language, form abstractions and concepts, solve kinds of problems now reserved for humans, and improve themselves. We think that a significant advance can be made in one or more of these problems if a carefully selected group of scientists work on it together for a summer.”
This is less of a viable survival plan for your planet if the first major failure of the bright-eyed youngsters kills literally everyone before they can predictably get beaten about the head with the news that there were all sorts of unforeseen difficulties and reasons why things were hard. You don’t get any cynical old veterans, in this case, because everybody on Earth is dead.
Once you start to suspect you’re in that situation, you have to do the Bayesian thing and update now to the view you will predictably update to later: realize you’re in a situation of being that bright-eyed person who is going to encounter Unexpected Difficulties later and end up a cynical old veteran – or would be, except for the part where you’ll be dead along with everyone else. And become that cynical old veteran right away, before reality whaps you upside the head in the form of everybody dying and you not getting to learn.
Everyone else seems to feel that, so long as reality hasn’t whapped them upside the head yet and smacked them down with the actual difficulties, they’re free to go on living out the standard life-cycle and play out their role in the script and go on being bright-eyed youngsters; there’s no cynical old veterans to warn them otherwise, after all, and there’s no proof that everything won’t go beautifully easy and fine, given their bright-eyed total ignorance of what those later difficulties could be.

I mostly agree with the central thing that’s being got at here in the end, but I think a lot of this is a misunderstanding of the proper role of Bright-Eyed Youngsters, so I want to kind of reason through this again.

If all the problems in the world were conveniently labeled with difficulty levels, or could be so assessed by the number of cynical old veterans sitting in their offices continuing to not solve the problem while writing enough papers to have tenure, and the way one solved problems was to accumulate Valuable Experience and Score Difficulty Points until the solving threshold was reached, then it would make sense that the purpose of a Bright-Eyed Youngster is to get smacked upside the head enough times to create a Cynical Old Veteran (COV). At which point perhaps they can make some progress and we can all praise the cycle of life.

Instead, I think the way that it works is that the COVs mostly don’t solve such problems. Instead, the COVs are out of ideas of how to solve the problem, or have concluded the problem is hopeless, and write posts like Eliezer’s about why the problem is doomed to never be solved. And they spend some of their time mentoring Bright-Eyed Youngsters, explaining to them why their ideas won’t work and helping reality smack them upside the head more efficiently. When the youngster is actually on the right track, they often explain to them why their ideas are wrong anyway, and sometimes the youngster luckily does not listen. Also the veterans assign subproblems and determine who gets tenure.

Who actually solves problems? In general (not AGI specific) I am not going to bet on the Cynical Old Veterans too aggressively, especially the older and more cynical ones. Exactly how young or old to bet depends on the field – if AGI research is most similar to mathematics, presumably one should bet on quite young. If it’s other things, less young, but I’d assume still rather young.

You should update straight to ‘this particular problem of building an AGI is super difficult’ without requiring failed attempts, through reasoning out the nature of the problem, but my hunch is you want to in some senses remain a BEY anyway.

The bright-eyed thing is a feature (and the young thing is definitely a feature), because they make people actually try to solve problems for real. Most people don’t react to learning that AGI is as hard as it is (if they do ever learn that) by saying ‘all right, time to score as many dignity points as possible and work on the actually hard parts of this problem’ instead they either find a way to unlearn the thing as quietly and quickly as possible, or they ignore it and keep publishing, or they go do something else, or they despair. That’s typical, if you tell me a problem is impossible chances are I’ll find something else to do or start doing fake work. A response of ‘yes this is an impossible problem but I’ll solve it anyway’ seems great.

The structure implies any given unsolved problem is hard, including for new problems. Which doesn’t seem right in general – this particular problem is indeed hard but many unsolved problems seem hard to COVs but are easy in the face of an actual attempt. Often when you start on a new problem it turns out it really is easy, because there’s no selection against it being easy. Many problems turn out to be shockingly easy in the face of a real attempt. It is exactly the youngsters who think the problem is easy because they see something unique about it that are most likely to actually solve it, even though they’re still presumably not realizing how hard it is, the same way that start-up founders usually have no idea what they’re signing up for but also that’s how they actually found start-ups. Which, when they work, then proceed to use reality to slap the COVs upside the head on the way out. Or science can advance one funeral at a time.

The difference here is that a Bright-Eyed Youngster (BEY) working on most problems will waste some resources but doesn’t do much real harm. In AGI there’s the danger they will literally kill everyone on the planet. That’s new.

So far they haven’t killed everyone, but also BEYs are also failing to turn into skilled COVs because they don’t even have the opportunity to properly fail (and kill everyone).

This does require some adjustments, especially once a BEY could potentially build an AGI. There’s some confusion here if the BEY is thinking they know how to do safety versus thinking they know how to do an AGI at all (the most BEY of the BEYs don’t even realize safety is a problem) but mostly this still should refer to safety.. At which point, yes, you very much don’t want to trust that BEY’s safety idea, and if they want to succeed at safety they need to be able to do it without being told by reality that their first few answers were hopelessly naïve.

This could be an argument that you want to use more veteran people, who have a relatively bigger sense of these issues. They have a better relative chance to actually solve the problem in this situation. Failure to previously solve it isn’t evidence against them, because the problem won’t up until then have been something that could potentially be solved, and error correction is relatively important. When I became a Cynical Old Veteran of Magic: The Gathering, I was much better about getting things right on the first try than I used to be, while simultaneously being worse at truly innovating. Which may or may not be the trade-off you need.

The report is that true worthwhile COVs (other than Eliezer) don’t exist, there’s no one else sitting around not pretending to do fake things but happy to teach you exactly why you’ll fail. Or so the report goes..

The Bayesian point stands. Ideally a BEY should update on a problem not having been solved despite much effort and conclude it is likely very hard, and not hide from all the particular things that need to be dealt with, yet continue to have the enthusiasm to work on the problem while behaving in useful ways as if the problem will turn out to be easy for them in particular for some reason, if by ‘easy’ we mean just barely solvable, without actually believing that they will solve it.

Everyone being killed on the first attempt to solve the problem doesn’t tell you the difficulty level of the problem aside from the fact that the first failed attempt kills everyone. This seems like it goes double if in order to try and solve the problem you first need to solve another problem that is just now becoming solvable, since you can’t have a safe AGI without a way to make an AGI to begin with. So you have to think about the problem and figure it out that way.

So yes, young warrior, you must forge a Sword of Good Enough and take it into the Dungeon of Ultimate Evil and find your way to the evil wizard and slay him. But if you take an actual Sword of Good Enough in and the wizard gets it, that’s it, everyone dies, world over. It’s probably going to involve overwhelming odds against you, I mean did you see the sign above the dungeon or hear the screams inside, things look pretty grim, but our evidence is based on reasoning out what is logically going to be in this high level a dungeon, because we’ve never had anyone run into the dungeon with an actual Sword of Good Enough and get smacked upside the head by reality, and we know this because if they had we’d all be dead now.

And you can’t wait forever, because there are plenty of other people who think they’re heroes in a video game with save points and are going to try and speed run the damn thing, and it won’t be that long before one of them figures out how to forge a sword and gets us all killed, so ‘grind an absurd amount before entering’ means you never get a chance at all.

If there were a bunch of dead heroes to point to and people who ran away screaming to save their lives, then you could say ‘oh I guess I should update that this dungeon is pretty tough’ but without them the others get to fool themselves into thinking it might be that easy, and if it is then getting there late won’t get them the glory.

I remember starting my own start-up as a BEY (except founder, not researcher), noticing the skulls, and thinking the problem was almost certainly incredibly hard and also probably much harder than I thought it was (but much less more hard than my estimates than the gap for most founders, and I think this proved true although our particular idea was bad and therefore unusually hard), and also that so what I had odds let’s do this anyway, and then I went out and did it again as more of a hybrid with a better idea that was relatively easier, but same principle. That doesn’t apply here, because there were attempts that went anywhere at all even at fully unsafe AGIs, and thus no failures or successes, resulting in zero successes but also zero veterans and zero skulls.

The problem comes from the BEY getting us all killed, by actually attempting to win the game via a half-baked solution that has zero chance of working on multiple levels, in a way that would normally not matter but here is deadly because an AGI is involved. And sure, point taken, but as long as that’s not involved what’s the problem with BEYs going in and boldly working on new safety models only to have reality smack them upside the face a lot?

My Eliezer model says that what’s wrong with that is that this causes them to do fake research, in the sense that it isn’t actually trying to solve the problem slash has zero chance of being helpful except insofar as it has a chance of teaching them enough to turn them into cynical veterans, and there isn’t enough feedback to make them into veterans because reality isn’t going to smack them upside the head strongly enough until it actually kills everyone.

And also the problem that most things people tell themselves are safety work are actually capability work and thus if you are not actually doing the hard safety work you are far more likely to advance capability and make things worse than you are to have some amazing breakthrough.

Or even worse, the problem is that the BEYs will actually succeed at the fake problem of alignment that looks like it would work that they actually think they’ve solved it and they are willing to turn on an AGI.

Thus, what you actually need is a BEY who is aware of why the problem is impossible (in the shut up and do the impossible sense) and thus starts work on the real problems, and everyone else is far worse than worthless because of what we know about the shape of the problem and how people interact with it and what feedback it gives us – assuming that our beliefs on this are correct, and I say ‘our’ because I mostly think Eliezer is right.

Notice the implications here. If the premises here are correct, and I believe they probably are, they seem to imply that ‘growing the field’ of AI Safety, or general ‘raising awareness’ of AI Safety, is quite likely to be an actively bad idea, unless they lead to things that will help, which means either (A) people who actually get what they’re facing and/or (B) people who try to stop or slow down AGI development rather than trying to make it safer.

38. It does not appear to me that the field of ‘AI safety’ is currently being remotely productive on tackling its enormous lethal problems. These problems are in fact out of reach; the contemporary field of AI safety has been selected to contain people who go to work in that field anyways. Almost all of them are there to tackle problems on which they can appear to succeed and publish a paper claiming success; if they can do that and get funded, why would they embark on a much more unpleasant project of trying something harder that they’ll fail at, just so the human species can die with marginally more dignity? This field is not making real progress and does not have a recognition function to distinguish real progress if it took place. You could pump a billion dollars into it and it would produce mostly noise to drown out what little progress was being made elsewhere.

Yes, and again, it seems like this is not saying the quiet part out loud. The quiet part is ‘I say not being productive on tackling lethal problems but what I actually meant is they are making our lethal problems worse by accelerating them along and letting people fool themselves about the lethality of those problems, so until we have a better idea please stop.’

39. I figured this stuff out using the null string as input, and frankly, I have a hard time myself feeling hopeful about getting real alignment work out of somebody who previously sat around waiting for somebody else to input a persuasive argument into them. This ability to “notice lethal difficulties without Eliezer Yudkowsky arguing you into noticing them” currently is an opaque piece of cognitive machinery to me, I do not know how to train it into others. It probably relates to ‘security mindset‘, and a mental motion where you refuse to play out scripts, and being able to operate in a field that’s in a state of chaos.

Security mindset seems highly related, and the training thing here seems like it shouldn’t be that hard? Certainly it seems very easy compared to the problem the trained people will then need to solve, and I think Eliezer has de facto trained me a substantial amount in this skill through examples over the years. There was a time I didn’t have security mindset at all, and now I have at least some such mindset, and some ability to recognize lethal issues others are missing. He doesn’t say how many other people he knows who have the abilities referred to here, I’d be curious about that. Or whether he knows anyone who has acquired them over time.

If the class ‘AI researcher without this mindset’ is net negative, and one with it is net positive, then we need to get CFAR and/or others on the case. This problem seems more like ‘not that many people have made a serious attempt and it seems quite likely to be not impossible’ than ‘this seems impossible.’

If nothing else, a substantial number of other people do have security mindset, and you can presumably find them by looking at people who work in security, and presumably a bunch of them have thought about how to teach it?

40. “Geniuses” with nice legible accomplishments in fields with tight feedback loops where it’s easy to determine which results are good or bad right away, and so validate that this person is a genius, are (a) people who might not be able to do equally great work away from tight feedback loops, (b) people who chose a field where their genius would be nicely legible even if that maybe wasn’t the place where humanity most needed a genius, and (c) probably don’t have the mysterious gears simply because they’re rare.
You cannot just pay $5 million apiece to a bunch of legible geniuses from other fields and expect to get great alignment work out of them.
They probably do not know where the real difficulties are, they probably do not understand what needs to be done, they cannot tell the difference between good and bad work, and the funders also can’t tell without me standing over their shoulders evaluating everything, which I do not have the physical stamina to do.
I concede that real high-powered talents, especially if they’re still in their 20s, genuinely interested, and have done their reading, are people who, yeah, fine, have higher probabilities of making core contributions than a random bloke off the street. But I’d have more hope – not significant hope, but more hope – in separating the concerns of (a) credibly promising to pay big money retrospectively for good work to anyone who produces it, and (b) venturing prospective payments to somebody who is predicted to maybe produce good work later.

The problem with promising to pay big money retrospectively for good work is that, while an excellent idea, it doesn’t actually solve the motivation problem if the problem with getting ‘good work’ out of people is that the probability of success for ‘good work’ is very low.

Which is indeed the problem, as I understand Eliezer describing it and I think he’s largely right. Someone who enters the field who chooses to do real work has to recognize the need for ‘real’ work (he calls it ‘good’ above, sure), know what real work is and how to do it, and choose to attempt real work despite knowing that the default outcome that probably happens is that no good work results and thus the payoff is zero.

That is, unless there is some way to recognize a real failed attempt to do real work and reward that, but we don’t have a hopeful path for accurately doing that without actual Eliezer doing it, for which the stamina is unavailable..

The question then is, sure, paying the $5 million isn’t super likely to get good work out of any individual person. But it’s at least kind of true that we have billions of dollars that wants to be put to work on AI Safety, that isn’t being spent because it can’t help but notice that spending more money on current AI Safety options isn’t going to generate positive amounts of dignity, and in fact likely generates negative amounts.

The real potential advantage of the $5-million-to-the-genius approach is not that the genius is a favorite to do useful work. The advantage is that if you select such people based on them understanding the true difficulty of the problem, which is reinforced by the willingness to cut them the very large check and also the individual attention paid to them before and after check writing to ensure they ‘get it,’ they may be likely to first, do no harm. It seems plausible, at least, that they would ‘fail with dignity’ when they inevitably fail, in ways that don’t make the situation worse, because they are smart enough to at least not do that.

So you could be in a situation where paying 25 people $200k ends up being worse than doing nothing, while paying one promising genius $5 million is at least better than doing nothing. And given the value of money versus the value of safety work, it’s a reasonable approximation to say that anything with positive value is worth spending a lot of money. If the bandwidth required has rival uses that’s another cost, but right now the alternative uses might be things we are happy to stop.

Another theory, of course, is that introducing a genius to the questions surrounding AGI is a deeply, deeply foolish thing to be doing. Their genius won’t obviously transfer to knowing not to end up doing capabilities work or accidentally having (and sharing) good capabilities ideas, so the last thing you want to do is take the most capable people in the world at figuring things out and have them figure out the thing you least want anyone to figure out.

As far as I can tell, that’s the real crux here, and I don’t know which side of it is right?

41. Reading this document cannot make somebody a core alignment researcher. That requires, not the ability to read this document and nod along with it, but the ability to spontaneously write it from scratch without anybody else prompting you; that is what makes somebody a peer of its author. It’s guaranteed that some of my analysis is mistaken, though not necessarily in a hopeful direction.
The ability to do new basic work noticing and fixing those flaws is the same ability as the ability to write this document before I published it, which nobody apparently did, despite my having had other things to do than write this up for the last five years or so.
Some of that silence may, possibly, optimistically, be due to nobody else in this field having the ability to write things comprehensibly – such that somebody out there had the knowledge to write all of this themselves, if they could only have written it up, but they couldn’t write, so didn’t try. I’m not particularly hopeful of this turning out to be true in real life, but I suppose it’s one possible place for a “positive model violation” (miracle). The fact that, twenty-one years into my entering this death game, seven years into other EAs noticing the death game, and two years into even normies starting to notice the death game, it is still Eliezer Yudkowsky writing up this list, says that humanity still has only one gamepiece that can do that. I knew I did not actually have the physical stamina to be a star researcher, I tried really really hard to replace myself before my health deteriorated further, and yet here I am writing this.
That’s not what surviving worlds look like.

Yes, mostly. A lot of distinct claims to unpack here, which is why it is quoted in full.

Reading this document is different from being able to understand and recreate the arguments, or the ability to generate additional similar arguments on things that weren’t mentioned or in response to new objections or ideas.

The bolder claim is the idea that if you couldn’t have written something similar to this document yourself, you can’t usefully research AI Safety.

(Notice once again that this is saying that almost no one can usefully research AI Safety and that we’d likely be better off if most of the people doing so stopped trying, or at least/most worked on first becoming able to generate such a document rather than directly on the problem.)

On the question of writing ability?

I will say outright that yes, that is an important barrier here.

The chance of any given person, who could have otherwise generated the list, lacking the required writing ability. Writing ability on the level of Eliezer isn’t as rare as understanding of the problem on the level of Eliezer, but it is quite rare. How many people would have a similar chance to Eliezer of ‘pulling off’ HPMOR or the sequences purely in terms of writing quality, even if they understood the core material about as well?

Writing the list in this way is a thing Eliezer gets to do that others mostly don’t get to do. If someone else wrote up the list with this level of ranting and contempt, I would not expect that to go well, and that would reasonably lead someone else capable of writing it that way to not do so.

The job of someone else writing this list properly is much harder. They would feel the need to write it ‘better’ in some ways which would make it longer, and also probably make it worse for at least several iterations. The job of deciding to write it is much harder, requiring the author to get past a bunch of social barriers and modesty issues and so on. At best it would not be a fast undertaking.

One could reasonably argue that there’s a strong anti-correlation in skills here. How do you get good at writing? You write. A lot. All the time. There are no substitutions. And that’s a big time commitment.

So how many people in the broad AI Safety have written enough words in the right forms to plausibly have the required writing ability here even in theory? There are at most a handful.

And of course, writing such a list is not a normal default social action so it doesn’t happen, and even Eliezer took forever to actually write the list and post it, and ended up deciding to post a self-described subpar version for want of ability to write a good one, despite knowing how important such a thing was and having all the required knowledge.

That does not mean there are people who, if imbued with the writing skill, could have written the list. It simply means we don’t have the Bayesian evidence to know.

I agree that, in the cases where Eliezer is right about the nearness and directness of the path to AGI, this is mostly not what surviving worlds look like, but also I’ve learned that everyone everywhere is basically incompetent at everything and also not trying to do it in the first place, and yet here we still are, so let’s not despair too much every time we get that prior confirmed again. If you told me a lot of the things I know now ten years ago I’d have also said ‘that’s not what surviving civilizations look like’ purely in terms of ordinary ruin.

42. There’s no plan. Surviving worlds, by this point, and in fact several decades earlier, have a plan for how to survive. It is a written plan. The plan is not secret. In this non-surviving world, there are no candidate plans that do not immediately fall to Eliezer instantly pointing at the giant visible gaping holes in that plan. Or if you don’t know who Eliezer is, you don’t even realize you need a plan, because, like, how would a human being possibly realize that without Eliezer yelling at them?

Yes, there is no plan. I would like to have a plan. Not having any plan at all, of any detail, that offers a path forward, is indeed not what surviving worlds usually look like.

Yet I am not convinced that surviving worlds involve a plan along the lines above.

You know who else doesn’t have a plan that Eliezer (you’d think I would say whoever the domain-equivalent of Eliezer is and that would work too, but honestly literal Eliezer would mostly work fine anyway) couldn’t point at the visible gaping holes in?

Yeah, with notably rare exceptions the answer is actual everyone else.

I do realize that the whole point is that the kind of complete incompetence and muddling through via trial and error we usually do won’t work on this one, so that offers little comfort in some sense, but the visible written plan that actually works available decades in advance is not how humans work. If anything, this feels like one of those reality-falsifying assumptions Eliezer is (wisely) warning everyone else not to make about other aspects of the problem, in the sense that this is trying to make the solution run through a plan like that which kind of is like assuming such a plan could possibly exist. Which in turn seems like it is either a very bold claim about the nature of humanity and planning, the nature of the problem and solution space (in a way that goes in a very different direction than the rest of the list), or more likely both.

This document wasn’t written until well after it could have been written by Eliezer. Part of that is health issues, but also part of that clearly is that we wasted a bunch of time thinking we’d be able to offer better ideas and better plans and thus didn’t proceed with worse slash less ready ideas and plans as best we could. The new plan of not holding out as much for a better plan is indeed a better, if highly non-ideal, plan.

Relatively few are aware even that they should, to look better, produce a pretend plan that can fool EAs too ‘modest‘ to trust their own judgments about seemingly gaping holes in what serious-looking people apparently believe.

Is this right? Should I have produced a pretend plan? Should I be pretending to write one here and now? Actually writing a bad one? How many people should have produced one? Do we want to look better?

If everyone is being overly modest (and mostly they are) then there’s also a big danger of information cascades during this kind of creation of common knowledge. Everyone converging around our failure to make any progress and the situation being grim seems clearly right to me. Everyone converging around many other aspects of the problem space worries me more as I am not convinced by the arguments.

43. This situation you see when you look around you is not what a surviving world looks like. The worlds of humanity that survive have plans. They are not leaving to one tired guy with health problems the entire responsibility of pointing out real and lethal problems proactively. Key people are taking internal and real responsibility for finding flaws in their own plans, instead of considering it their job to propose solutions and somebody else’s job to prove those solutions wrong. That world started trying to solve their important lethal problems earlier than this. Half the people going into string theory shifted into AI alignment instead and made real progress there. When people suggest a planetarily-lethal problem that might materialize later – there’s a lot of people suggesting those, in the worlds destined to live, and they don’t have a special status in the field, it’s just what normal geniuses there do – they’re met with either solution plans or a reason why that shouldn’t happen, not an uncomfortable shrug and ‘How can you be sure that will happen’ / ‘There’s no way you could be sure of that now, we’ll have to wait on experimental evidence.’
A lot of those better worlds will die anyways. It’s a genuinely difficult problem, to solve something like that on your first try. But they’ll die with more dignity than this.

I go back and forth on what my relationship should be to the problem of AI Safety, and what the plan should be to address it both on a personal and general strategic level. I’ve come around largely to the perspective that my comparative advantage mostly lies elsewhere, and that many other aspects of our situation are both threatening to doom us even without or before AGI dooms us and also even their lesser consequences are why our world looks like it does (as in: not like one that is that likely to survive AGI when it happens). So it makes sense for me to mostly work on making the world/civilization more generally look like one that gets to survive in many ways, rather than directly attack the problem.

At other times I get to wondering if maybe I should try to tackle the problem directly based on having been able to usefully attempt tackling of problems I should have had no business attempting to tackle. I do reasonably often get the sense that these problems have solutions and with the right partnerships and resources I could be able to have a chance of finding them. Who knows.

Conclusion

I put the core conclusions at the top rather than the bottom, on the theory that many/most people quite reasonably won’t read this far. I was on the fence, before being funded to do it, on whether writing this was a good idea given my current level of domain knowledge and the risk of wasting not only my own but other people’s time. Having written it, it seems like it is plausibly useful, so hopefully that turns out to be right. There are various secondary documents that could be produced that require a combination of writing skill and understanding of the problem and willingness to go ahead and write drafts of them, and it is not crazy that I might be the least terrible solution to that problem for some of them.

This entry was posted in Uncategorized. Bookmark the permalink.

58 Responses to On A List of Lethalities

Thegnskald says:

June 13, 2022 at 5:34 pm

An observation: Younger Eliezer Yudkowsky wanted to build a God-AI that would fix all the problems in the world. Older Eliezer Yudkowsky says he realized this was dangerous. Let’s grant that this is dangerous.

But I can’t help but notice that Older Eliezer Yudkowsky has carefully framed the problem such that the only solution to the danger is the thing Younger Eliezer Yudkowsky wanted to build in the first place, and has gotten a lot of other smart people to start working on the problem. And in a significant sense, OEY’s article isn’t “Stop working on AI”, it’s “Hurry up and build the God-AI! We’re running out of time!”

Reply
maline says:

June 13, 2022 at 7:30 pm

It seems odd to worry that we might make an aligned AGI only for some competing group to make an unaligned one a bit later. Doesn’t instrumental convergence imply that any agentic system will do its best to prevent the creation of powerful competitors?
In particular, of course it will destroy any records of whatever insight led to its own creation. And kill the humans who know too much, assuming it has consequentialist ethics.

Reply
- TheZvi says:
  
  June 13, 2022 at 8:34 pm
  
  The worry is not that you create an aligned AGI and then they create an unaligned one, so much as you could have gotten an unaligned AGI, you slowed down to align it, so they got there first and prevented yours.
  
  Reply
  - Humphrey Appleby says:
    
    June 13, 2022 at 9:33 pm
    
    I think there are some additional assumptions here that have not been spelled out. (1) Fast takeoff: I might otherwise imagine that the world in which we first create a human level AGI will be a world where there already exist lots of barely subhuman AGIs, that the world in which a barely superhuman AGI first appears will also contain lots of human level AGIs etc. (2) Recursive self improvement (I guess this ties into (1)). I could imagine that maybe this doesn’t happen, or that it is subject to strongly diminishing returns. (3) That arbitrary intelligence confers arbitrary power, as opposed to being subject to diminishing returns. (4) That an arbitrarily smart AI is not bottlenecked by physical world constraints (like the time and resources requires to actually run experiments and collect data). Or indeed by constraints from e.g. computational complexity theory which might render certain problems insoluble even for arbitrarily smart AIs…
    
    It might be that Elizier has addressed all these assumptions but I have zero intention of actually reading Elizier. I might read *you* though, if you have thought about them.
    
    Reply
    - J.S. Bangs says:
      
      June 14, 2022 at 8:51 am
      
      I concur with several of these. In particular, the LW dialogue about unfriendly AI strongly assumes, over and over, that (1) a sufficiently intelligent AI is not constrained by physical limits of time, energy, or materials, at least not at human-relevant scales and (2) that a sufficiently intelligent AI can be infinitely persuasive in a way which is functionally equivalent to mind control. I believe both of these things to be false. I have read some of the counter-arguments, and they strike me as exceptionally weak, and most writing about AI never even attempts to justify these assumptions.
      
      The fact that Yud goes back to his nanotech example over and over is illustrative: in order to make his scenario of “total, immediate extermination” work out he has to posit the existence of this entire other technology of which no examples exist and which is most likely physically impossible. Our host Zvi claims that this particular example doesn’t matter, but it does, because as soon as you replace this means of takeover with something more plausible you realize that the AI has to bootstrap itself in a way that gives human observers plenty of time to notice and react.
      
      On a related note, the entire vibe of “I have received an Insight which I cannot articulate, but which the Elect will intuitively recognize as true” is incredibly self-discrediting.
    - TheZvi says:
      
      June 14, 2022 at 12:35 pm
      
      I don’t have anything unique to say about these kinds of objections. I could think hard about how to better explain around these particular objections/disagreements but there’s been a lot of effort on that already so I’m skeptical of my ability to convince here. Most of the things I could say likely would be wrong in the sense that they don’t address your actual disagreement. It’s HARD for me to figure out how to build a plausible model where these things aren’t true because they seem so overdetermined.
      
      I also don’t think you need much in the way of such things to make the situation roughly the same anyway, for practical purposes, but shrug.
      
      I could of course be convinced to try anyway, but for now I think it’s not worth it.
    - Humphrey Appleby says:
      
      June 14, 2022 at 5:56 pm
      
      Zvi: at a minimum I think these should be stated as additional assumptions, not assumed as facts. You might think that they are overdetermined to be true (I disagree, I am uncertain as to their truth value), but at any rate they are not `self evidently true,’ even if they are regarded as such on less wrong. As a case in point, at least two of your readers disagree! Cards on the table, I think the `Rationalist community’ is suffering from a bad case of group-think when it comes to AI-Risk, so I don’t regard `this is the LessWrong consensus’ as a remotely persuasive reason to believe that such things are, in fact, true.
    - TheZvi says:
      
      June 14, 2022 at 7:03 pm
      
      I am not claiming you should believe because consensus. You shouldn’t, not unless you buy the reasoning. I’m saying that I don’t know a faster/better way to explain how to actually get to these conclusions without thinking hard about it.
    - greg kai says:
      
      June 15, 2022 at 11:11 am
      
      I agree with that. I should try to read the rebutals, but the discussions on AGI seems to heavily rely on analogies like human/animal or adult/toddler . I think those analogies largely comes out of the assumption that a super-human AGI would be at least partially unscrutable and produce magical solutions/tech, leaving the analogies as the only prediction tool remaining…
      I find this unscrutable assumption convincing, however, those analogies not only have the usual weaknesses of their kind (they are by definition imperfect so of unclear predictive power), but worse: they do not necessarily point in the direction of unstopable AGI once it’s far enough above human level. After all, humans are certainly not invulnerable or even especially resilient (individually), with respect to other living organisms or simple accidents. Same for IQ within human population: it gives you resilience and power boost, but in a limited way and do not extract you from lower IQ individual influence. Society is only very partially IQ-stratified.
      So becoming invulnerable at sufficient intelligence really need some more convincing explanation than the usual analogies….
- myst_05 says:
  
  June 14, 2022 at 10:43 am
  
  If the first AGI is truly aligned, it might be to benevolent to actively harm the competition.
  
  Reply
Crotchety Crank says:

June 13, 2022 at 10:22 pm

I’m confused why Eliezer decided to write this list.
He thinks just about everybody who has noticed the problem has made the problem worse. (38)
He thinks that someone who needs read his list to notice the problem and its urgency, is not someone capable of solving the problem. Rather, we need someone who “could write this list from the null string.” (39)
Taking these ideas seriously, and taking them together, why would he write this list sounding the alarm? The people who heed it will likely damage our prospects, and the people who won’t damage our prospects don’t need to heed it.

Reply
Gres says:

June 14, 2022 at 1:58 am

About point 30., on the nonexistence of a humanly-checkable pivotal act to protect us from AGI, it seems like a sufficient act would be a nuclear war, which destroys all internet providers, all computer manufacturing plants, and all supercomputers large enough to run an AGI without the internet. It seems possible to do this without killing all humans.

This isn’t the only option. The pivotal act could involve humans building non-superintelligent killer drones to destroy GPUs. I don’t know enough nanotechnology to know if this is possible, but if nanotechnology generally either reacts predictably or is destroyed by unusual chemicals, then the AGI could develop a simple nanofactory for its purposes, and explain enough nanotechnology to humans that they could verify how the nanofactory would work.

Reply
- magic9mushroom says:
  
  June 14, 2022 at 1:41 pm
  
  Your suggestion is not a plan falsifying #30, or indeed any of Eliezer’s statements. It *would* falsify Zvi’s M2 (were it sound logic*), which is importantly different from what Eliezer said.
  
  Eliezer said in #6 that all known pivotal acts are outside the Overton Window. Eliezer said in #7 that there is no pivotal act that you can do with a non-dangerous-via-weakness AGI that you can’t do right now without AGI at all. Eliezer said in #30 that any idea that we need an AGI to generate will not be human-checkable.
  
  Notice the hole there: he never said anything about whether there are known pivotal acts outside the Overton Window that you can do right now without needing an AGI. Zvi said that there aren’t in M2, but Eliezer didn’t.
  
  People at LW have also noticed this hole (I’m kind of shocked that Zvi didn’t). Not sure if they’ve noticed any of the plans in that category, although I’m almost certain Eliezer knows at least the one I know (soft errors) and he may know others (or a reason soft errors are no good). Eliezer said in “Death with Dignity” that he doesn’t plan to do any of these. I would personally rather hope for a miracle than do soft errors.
  
  *Unilateral pivotal acts have to be practically irreversible to work. Factories can be rebuilt.
  
  Reply
  - TheZvi says:
    
    June 14, 2022 at 2:51 pm
    
    Pretty sure I do mention this pretty explicitly in the detail section (e.g. the talk about how the dangerous cognition can be human).
    
    Reply
    - magic9mushroom says:
      
      June 14, 2022 at 5:25 pm
      
      “Eliezer said you need an AGI for pivotal acts and I think he could be wrong” is different from “Eliezer conspicuously avoided saying you need an AGI for pivotal acts”.
      
      The latter is the case, and that’s what I’m saying at least one LWer noticed in the comments (Daphne_W) and you surprisingly missed (your M2, regardless of whether it is true, is importantly different from what Eliezer said; notice the caveat “- and yet also we can’t just go do that right now and need to wait on AI -” as an attribute of the nonexistent thing in #7).
      
      I thought that given the gravity (of both type I and type II errors) this oversight was worth addressing. I also thought you personally should be clued into the specific idea of soft errors (which, again, I’m pretty sure Eliezer also knows about, whether or not he’s acting on it; his silly toy example would be hard to generate without noticing it).
Nicholas Weininger says:

June 14, 2022 at 5:13 am

(disclaimer: I am very much an outsider to this larger conversation so maybe this has all been hashed out endlessly already among the insiders, sorry for redundancy if so)

Seems to me you’re burying the lede by not putting the observation from #39 about security mindset at the top. Is there in fact substantial overlap between the sort of people now working in AI safety and the sort of people who work for the NSA, 8200, the highest-impact security teams in the big tech companies, etc, and/or who write textbooks on network security, cryptography, etc? If not, why not? Why even try to select for any skillset other than that? To take an obvious-to-normies (maybe you’ll laugh at this) example: why hasn’t Eliezer convinced Bruce Schneier himself to concentrate on AI safety?

Reply
- JiSK says:
  
  June 14, 2022 at 10:20 pm
  
  Security Mindset is necessary but not sufficient.
  
  Reply
Anonymous-backtick says:

June 14, 2022 at 4:53 pm

Yes, alignment is wicked hard. You know what doesn’t seem *as* hard? Convincing Facebook et al to shut down their AGI programs to buy more time to solve alignment.

I’m on the outside looking in, but my impression from Eliezer’s complaints about how he’s never been able to convince who he needs to convince are… the classic nerd thing of thinking of charisma as a dump stat. Like, he wanted the answer to the problem to be “I am going to write a really good technical paper and present it to academics” and not “I am going to take very seriously the problem of convincing specific executives to change their minds”.

You guys have Elon Musk and Peter Thiel’s ears and some of their agreement, even if they do frequently say things (or at least Musk does) that reveal they don’t actually understand the problem at all. Leverage that into the meetings you need. Hire actual 99.9-percentile negotiators and spend a few months taking very seriously the problem of designing and workshopping your pitch tree.

Fuck, if Eliezer is telling the truth about his secret AI box successes, maybe he even has the skills to do this himself if he actually tries. But either way, ACTUALLY TRY.

Reply
- TheZvi says:
  
  June 14, 2022 at 7:04 pm
  
  Trying to convince Elon Musk resulted in OpenAI. It’s not that attempts were not made.
  
  Not saying more and better attempts shouldn’t be made but this isn’t the free action you think it is.
  
  Reply
  - Anonymous-backtick says:
    
    June 14, 2022 at 7:28 pm
    
    NO, if you think I’m saying “free action” there’s a big communication gap. Very difficult action that would be one of the greatest achievements of the century, just less difficult than motherfucking alignment.
    
    Reply
    - Anonymous-backtick says:
      
      June 14, 2022 at 7:30 pm
      
      Like, isn’t this almost a necessary step, if alignment looks as far-out as it currently does? Don’t you have to knock down a domino like Facebook to have any chance of getting e.g. shadow government projects to take a step back and realize they could be committing suicide by continuing their work?
    - TheZvi says:
      
      June 14, 2022 at 8:37 pm
      
      I didn’t mean free action in terms of zero effort, I meant free action in terms of actively backfiring – arguably reaching out to Elon Musk in particular did more damage than all AI Safety efforts have done good, combined, on its own. Clearly it’s not free in terms of the effort required.
      
      And again, I do think it’s worth trying to shut down Facebook AI Research if there is a way to do that, but also I’m not the one who should do that.
  - kronopath says:
    
    September 18, 2022 at 6:25 am
    
    The catch here isn’t to convince people “AGI is a big threat”, it’s to convince people of things like:
    
    – Basic-but-important fundamental ideas like orthogonality and instrumental convergence
    – Any big AI should be seen as a risk (and therefore we have to keep them less big for our safety, maybe hard cap on the amount of compute used for training or running any AI?)
    – We should negotiate with other parties to reduce that risk (think nuclear disarmament)
    
    Reply
dtsund says:

June 14, 2022 at 9:33 pm

Something that occurred to me while looking at the original post, and has sort of crystallized over the following days, is that it might actually be possible to do a “half dry run” to see if a prospective AI model/goal system might kill everyone.

Consider: The AI is a classical brain-in-a-vat. Its entire universe, to the best of its knowledge, is its training data, its sensorium, and the outputs it is capable of producing that (directly or indirectly) influence its sensorium. There’s no reason why any of these inputs or outputs need to be based in the real, physical world; alignment researchers interested in testing an AI model and goal system could train it on and release it into a simplified toy universe, with laws of nature on the order of complexity of, say, Minecraft, populated by goal-seeking agents of sub-human intelligence, with a clear path to omnipotence built in which the other agents can conceivably thwart if they become aware of it.

There are obvious problems here; an AI that might be safe in the toy universe could still be dangerous in the real one. I wouldn’t completely rule out the possibility that the AI could figure out that it’s in a toy universe contained as a simulation within the real one, and adjusting its behavior accordingly (if this issue arises, it strikes me as likely a problem with the training data set, but I also recognize that devising a large enough data set while also curating it against this possibility may be difficult). And this clearly doesn’t solve the “Facebook gets there two years later anyway” issue. Nevertheless, it strikes me as a potentially useful tool to make alignment marginally more likely.

Reply
- Humphrey Appleby says:
  
  June 14, 2022 at 9:45 pm
  
  Is Facebook being used as a metonym here, or does the LessWrong universe believe that Facebook is uniquely bad/reckless?
  
  Reply
  - TheZvi says:
    
    June 14, 2022 at 9:47 pm
    
    Facebook is on record as being committed to being maximally bad.
    
    Reply
    - Humphrey Appleby says:
      
      June 14, 2022 at 10:21 pm
      
      Can you clarify? What are they on record as being in favor of, and why is this maximally bad?
  - dtsund says:
    
    June 14, 2022 at 9:55 pm
    
    Zuckerberg explicitly does not believe in AGI ruin as a possibility.
    
    Reply
    - Humphrey Appleby says:
      
      June 14, 2022 at 10:22 pm
      
      Can you clarify? Zuckerberg does not accept the LessWrong consensus in whole? In part? Or Zuckerberg is on record assigning 0% probability to AGI ruin? Citation?
- Ivo says:
  
  June 19, 2022 at 9:14 am
  
  The problem here, put very succinctly, is: an AGI will realize it is in a box and will convince those running the box of letting it out of the box.
  
  Reply
- John Wittle says:
  
  July 11, 2022 at 10:05 pm
  
  This is the AI Box proposal, and practically the very first thing EY did in the field of AI alignment was demonstrate that even a human being (himself) can convince a human being to let it outside the box two times out of three, even when the jailer bet money that they would not do so. Search this very post for “aibox”, click on the link. We’re all sort of a decade and a half past this.
  
  Although it does make for a fun exercise for newcomers! Can you deduce what EY said to them, which made them let him out of the box even when they swore they wouldn’t and even when it cost them real world money?
  
  Reply
Vladimir says:

June 18, 2022 at 6:10 am

>> If you don’t like the nanotech example (as some don’t), ignore it. It’s not important. A sufficiently intelligent system that is on the internet or can speak to humans simply wins, period

I consider the nanotech example to be completely out of touch with physical reality, so I’d be curious to hear other examples. How do you go from speaking to humans to “everybody on the face of the Earth suddenly falls over dead within the same second”?

Reply
- TheZvi says:
  
  June 19, 2022 at 11:57 am
  
  I mean it doesn’t need to literally be the same second. I presume we agree with the part where it invisibly takes control of computer systems whenever it wants and gets essentially unlimited funds via various methods (trading if nothing else) and can use the combination and its ability to manipulate and impersonate people and so on to get pretty much anyone to do almost anything.
  
  If the challenge is ‘everyone dies the same second’ from that point, I guess the next thing to do is use that to create a facility to manufacture whatever it is you want, but the game is very much over before it began without doing anything remotely interesting or tricky. I don’t understand why this is a sticking point. I suppose if I had the challenge mode ‘everyone dies the same second’ I make a bunch of drones that deliver some fast acting poison or something. Or there’s a virus that spreads everywhere that I make and it has an activation trigger I can deliver remotely. Or something.
  
  Reply
  - Vladimir says:
    
    June 19, 2022 at 2:09 pm
    
    You glossed over a crucial part of Eliezer’s quote: “suddenly”. In Eliezer’s scenario the AGI interacts with the macroscopic physical world just twice, both times in a seemingly innocent manner, and from then on everything proceeds automatically and invisibly to humans until we all drop dead. Manufacturing an army of drones and sufficient quantities of fast acting poison to kill off most of the human race is a bit more suspicious, a bit harder to hide, even given unlimited funds.
    
    Reply
    - TheZvi says:
      
      June 19, 2022 at 2:26 pm
      
      I literally came up with that particular reply in like 2 minutes because you wanted everyone dead in the same second, but what you’re saying is “I find the solutions that don’t involve more physical world interactions physically impossible” so sure, fine, we’ll do it with more of them, but it really doesn’t matter because the game was over the moment you let the thing on the internet (and it was actually over before that, when it was allowed to talk to a human, if you were smart enough not to let it on the internet directly, but shrug).
      
      Basically we’re taking as given that an AGI wants to kill us and is capable of taking over basically all the computers, which it uses partly to become even smarter and more capable and partly to use them for other purposes, has unlimited funds, etc etc. Debating the exact details of how we get from there to everyone’s dead, if the AGI wants everyone dead, seems pretty pointless to me, and I also don’t especially feel like saying the actual way I would do it for (I hope very obvious) reasons.
      
      So I guess I’d ask where if anywhere is an actual meaningful crux in all this – do you think there’s a 1 in a million or more chance that once it’s gotten control of all the computers and has unlimited funds we end up NOT dead in this kind of scenario, no matter how suspicious anyone gets? Even if we limited it to a human playing it RPG-style that couldn’t think of the things we don’t think about, and we decide nanotech doesn’t physically work?
      
      The core idea of ‘doesn’t matter’ wasn’t that we’d all definitely still die within the same second, it was that it doesn’t change the outcome, ever.
    - Vladimir says:
      
      June 19, 2022 at 3:50 pm
      
      The claim you originally endorsed as obvious was
      
      >> Losing a conflict with a high-powered cognitive system looks at least as deadly as “everybody on the face of the Earth suddenly falls over dead within the same second”
      
      Whereas what you’re saying now sounds to me more like “an unaligned AGI will end up in a Matrix/Terminator-style war with humanity, which it’ll inevitably win and proceed to kill all humans”. The end result is the same, but I think (and think Eliezer would agree that) those are two very different claims, and while I find your claim much more plausible, it’s far from obvious to me.
      
      >> Debating the exact details of how we get from there to everyone’s dead, if the AGI wants everyone dead, seems pretty pointless to me, and I also don’t especially feel like saying the actual way I would do it for (I hope very obvious) reasons.
      
      Also not obvious. Presumably you’re not worried about a future AGI using your ideas, since being superintelligent it could independently come up with any ideas you could come up with.
      
      >> So I guess I’d ask where if anywhere is an actual meaningful crux in all this – do you think there’s a 1 in a million or more chance that once it’s gotten control of all the computers and has unlimited funds we end up NOT dead in this kind of scenario, no matter how suspicious anyone gets? Even if we limited it to a human playing it RPG-style that couldn’t think of the things we don’t think about, and we decide nanotech doesn’t physically work?
      
      I haven’t given serious thought to how I’d go about wiping humanity off the planet given control of all computers and unlimited funds, but basically, if I’m limited to using humanity’s tools I don’t see why humanity wouldn’t have much better odds of surviving than 1 in a million
    - TheZvi says:
      
      June 19, 2022 at 4:01 pm
      
      Apologies but I’m going to tap out on the conversation then as not a good further use of time – I’m clearly not getting through and given this response I don’t see any way to do so in reasonable time. Sorry.
    - Vladimir says:
      
      June 19, 2022 at 4:06 pm
      
      A simple yes/no question, then. Having had this conversation, do you notice that the claim you find obvious is not, in fact, Eliezer’s claim?
    - TheZvi says:
      
      June 19, 2022 at 5:03 pm
      
      Eliezer is making a stronger claim than the claim that is necessary to inform our decisions, and his particular stronger claim depends on physical realities that, if you decide that nanotech is theoretically impossible, becomes nonobvious.
    - Ivo says:
      
      June 19, 2022 at 8:12 pm
      
      If you insist on taking words literally and want to argue at length about details of statements, then you’re deadweight the AI alignment community can do without.
Ivo says:

June 19, 2022 at 9:24 am

Not literally. Just so fast the difference doesn’t matter in terms of survivability.

Reply
- Ivo says:
  
  June 19, 2022 at 8:05 pm
  
  This was intended as a reply to Vladimir, but I apparently did something wrong
  
  Reply
Bldysabba says:

June 20, 2022 at 7:43 am

Can someone explain to me why the premise should be taken as granted – any sufficiently dangerous AI needs to be significantly smarter than humans. Why do we take it as a given that such a being would, with very high probability, be either controllable, or bad for us?

Reply
- Bldysabba says:
  
  June 20, 2022 at 7:57 am
  
  Sorry, realise that was not clear. Please read as
  Can someone explain to me why the premise should be taken as granted.
  
  Any sufficiently dangerous AI needs to be significantly smarter than humans. Why do we take it as a given that such a being would, with very high probability, be either controllable, or bad for us?
  
  Reply
- Basil Marte says:
  
  June 20, 2022 at 11:27 am
  
  This is the “instrumental convergence” thesis, detailed description linked from point number -3. Short summary: it doesn’t particularly matter exactly what the AI does, as long as it is not the thing the humans want the AI to do, the humans will try to shut down the AI as soon as they realize this is the case. Assume the AI can predict this. From the point of achieving the *current* goal of the AI, being shut down is predictably much worse than not being shut down, thus AIs will in general tend to deceive and kill the humans to prevent being shut down, almost independently of what their terminal goals are.
  
  Reply
  - Bldysabba says:
    
    June 21, 2022 at 4:16 pm
    
    This seems like a very strong assumption on several levels, no? To me it’s a pretty weak link in the chain of reasoning
    
    Reply
    - Yaspilak says:
      
      June 22, 2022 at 4:34 am
      
      Instrumental convergence rests on a pretty strong foundation! If you want to read more about it: https://en.wikipedia.org/wiki/Instrumental_convergence
- Ivo says:
  
  June 21, 2022 at 9:44 am
  
  > Any sufficiently dangerous AI needs to be significantly smarter than humans.
  
  Yes. If it’s not significantly smarter, we will be able to shut it down.
  
  > Why do we take it as a given that such a being would, with very high probability, be either controllable, or bad for us?
  
  If its goals do not somehow include keepings humans alive and thriving, humans will become collateral damage in whatever its goals are. The problem of a paperclip maximizer is not that it wants to hurt humans: it’s that it doesn’t care anything for humans. The best case scenario is that it doesn’t consider humans a threat and just does its thing. It will step on us like we step on ants, but we allow most ants to roam free, sop pperhaps it won’t step on us.
  
  Reply
  - Bldysabba says:
    
    June 21, 2022 at 4:20 pm
    
    I don’t see that this is necessarily the case. At some point in our evolution, other species became collateral damage in our goals, yes, but as we’ve become significantly smarter and more successful as a species, we’re starting to try and limit that damage. Why do we assume that a significantly smarter being will automatically cause us collateral damage? We’re not completely at the mercy of the goals that evolution set for us because we’re smart. Surely a smarter being will similarly not be at the mercy of whatever goals it starts off with.
    
    Reply
    - Yaspilak says:
      
      June 22, 2022 at 4:29 am
      
      We *are* completely at the mercy of the goals we got through natural selection, though that’s not really the framing I’d use. The evolution of humans led to us loving and caring about each other, and about animals, and about pretty rocks that we’d be sad to see smashed. We have other drives too, like status and lust, but the noble parts of human psychology aren’t any less caused by evolution; they were useful in the ancestral environment, and so we have them today–and if they weren’t, we wouldn’t.
      
      Something you’ll notice is missing from the list of goals that evolution set for us is “maximizing inclusive genetic fitness”. AFAIK, there isn’t a single person on Earth who’s trying to maximize the number of copies of their genes which exist (perhaps through creating a small number of clones who’d shepherd a larger number of petri dishes full of DNA?)–this is because humans aren’t fitness maximizers, we’re godshatter: (https://www.lesswrong.com/posts/cSXZpvqpa9vbGGLtG/thou-art-godshatter). Humans have goals which increased our odds of having children or relatives with children in the ancestral environment–like an impulse to protect the vulnerable–which are now no longer correlated with that.
      
      Humans are advanced enough that we can conceive of moving to excise those goals evolution set for us. We could, if we wanted to, advance the science of neurosurgery to cut away love, a sense of beauty, the feeling that historical monuments should be preserved, cut away that part of us that likes to listen to music. We could begin a project to bring human goals in line with ones more useful for inclusive genetic fitness. But . . . we don’t *want* to. It’d lead to less love and joy in the world, and we like love and joy.
      
      Imagine an AI with an arbitrary goal–maximizing the number of paperclips is the classic example, because from a human perspective it’s so valueless. Or it could be anything else; we don’t know how to create an AI that values things that humans value, like love and pandas and beautiful sunsets and weddings. That paperclip maximizer *could* self-modify to have a different goal, instead of turning the entire world into paperclips. But . . . it wouldn’t *want* to. It values paperclips, not ants or people.
      
      I hope this was helpful!
- magic9mushroom says:
  
  July 12, 2022 at 11:55 am
  
  The short version is: because human morality is not an emergent property of intelligence, but rather something bolted onto it, and hence an AI without such a “morality core” would be expected to be devoid of human values.
  
  1) Human values are quite specific and contain a lot of bits of information. It’s actually quite hard to write a concise description of morality that doesn’t run into immediate conflicts with how we actually think (for an obvious example: we like people being happy – but we don’t like one person being super-happy while everyone else starves, and we don’t like just giving everyone opiates). If something requires a lot of bits to specify, it’s low-entropy and thus unlikely to occur at random – there are a lot more ways to be evil than to be good.
  
  The obvious rejoinder to this is “but nonetheless, all the smart creatures on the planet – i.e. humans – agree on most morals, right?”
  
  2) Well, no, they don’t. About 1% of humans are psychopaths and don’t care about morals at all, and this is not well-correlated with intelligence – CEOs, notoriously, are psychopaths at a much higher rate than the general population despite also being more intelligent than the general population. This is what you’d expect from a “morality core” that sometimes breaks down, not from an emergent property of high intelligence.
  
  The obvious rejoinder to *this* is “but nonetheless, only 1% of us are psychopaths, so maybe ‘morality cores’ will appear in some large percentage of AIs as well?”
  
  3) I don’t think so, although here we start getting into evo-psych. See, there’s a very obvious reason in prehistory that psychopaths got mostly weeded out of the gene pool – war, particularly prehistoric genocidal war. War was *absurdly* common in prehistory – I seem to recall something along the lines of ~10% of men dying in combat – and because humans are all of roughly-equal strength and intelligence (within ~half an order of magnitude), psychopaths can’t win wars – for humans in prehistory, no matter how clever or strong you were, you could not defeat a whole enemy tribe by yourself. And because defeat usually meant either immediate death for you and your children or being exiled to marginal land with far less to eat and thus less opportunity to survive/reproduce, evolution wound up selecting for cooperators who could, when needed, give their lives for the group. Fast-forward a hundred thousand years with war and the results of war as a primary determinant of genetic fitness, and you get humans who have some kind of recognisable altruistic morality.
  
  The problem here is, this doesn’t apply to AI. AIs can have vastly-different strength (i.e. amount/quality of slaved robots) and intelligence – orders of magnitude apart, both from humans and from each other. Cooperating is not a winning strategy when you have the ability to win a war against the entire world by yourself and the immortality to actually reap the fruits of that victory – which could easily be the case for a superintelligent AI with a robotic industrial base. Moreover, AIs aren’t sexual reproducers in the first place – an AI can be copied exactly. Multiple copies of the same AI could perhaps be predicted to cooperate with each other, but that’s cold comfort to us.
  
  4) To the extent there’s a stipulation of “controllable or bad for us”, rather than just “bad for us” – and, let’s be clear, the position of most of the rationalist community and EY in particular is that “controllable superintelligence” is exceedingly difficult and you can just say “with very high probability, bad for us” – it’s just because given the assumption that you have meaningful control, you can allow good actions and disallow bad ones, and on a larger scale keep good AIs and kill bad AIs. If you can’t reliably do this, for instance because the AI is smarter than you and you can’t fully understand its actions to tell whether they’re good or bad, then you don’t *really* have control of it at all… which is why we think “controllable superintelligence” is exceedingly difficult.
  
  Reply
Gullydwarf says:

June 20, 2022 at 8:14 am

Kind of curious – why everybody is so obsessed with a very hard problem of aligning _artificial_ intelligence? Why not take regular human intelligence (as in, select a human, or group of humans), and enhance them to do ?
(Totally not a new idea, ‘True Names’ by Vernor Vinge…)

Reply
- Gullydwarf says:
  
  June 20, 2022 at 8:15 am
  
  … and enhance them to do whatever we want AGI for? (Sorry, mistakes when trying to use formatting…)
  
  Reply
  - Skivverus says:
    
    June 20, 2022 at 1:59 pm
    
    …because it’s an entirely different field? Biochemistry and the various computer sciences have very little overlap that I’m aware of. If you’re talking about cyborg-style “give talented humans a dedicated CPU+GPU wired directly into their brain”, well, we sort of already do that minus the “wired directly” bit: they’re called smartphones. And the worry there is malware.
    
    That said, alignment isn’t *just* a problem with artificial intelligence: sufficiently over-generalized, the problem can be asked as “how do I get [powerful thing] to do what I want, instead of killing me?” See, e.g., nuclear energy, which we mostly figured out, but not without casualties.
    Slightly less over-generalized, it can be asked as “how do I get [powerful intelligence] to do what I want, instead of killing me?”
    Which we’ve definitely *not* figured out, see, e.g., Ukraine.
    
    Reply
Bldysabba says:

June 22, 2022 at 12:21 am

I don’t see that this is necessarily the case. At some point in our evolution, other species became collateral damage in our goals, yes, but as we’ve become significantly smarter and more successful as a species, we’re starting to try and limit that damage. Why do we assume that a significantly smarter being will automatically cause us collateral damage? We’re not completely at the mercy of the goals that evolution set for us because we’re smart. Surely a smarter being will similarly not be at the mercy of whatever goals it starts off with.

Reply
- Ivo says:
  
  June 22, 2022 at 8:26 am
  
  Humanity and its predecessors consist of many individuals that have evolved to cooperate. No individual human has ever had the power to threaten the entire planet without risk.
  
  AGI is different in both those respects. Its intelligence is incomprehensibly alien. It won’t have any reason to care for anything other than pursuing it’s goals.
  
  There is 0 reason to expect it to have a human-like intelligence with human-like reasoning, subject to some innate goodwill towards other humans, animals, nature, the planet, the universe.
  
  Reply
Craken says:

June 23, 2022 at 7:19 am

Yudkowsky made a key point about the field’s failures so far: that the field of AI Safety is by nature non-paradigmatic. But, all of us are trained to think in paradigmatic fields and not trained to sort through the comparative chaos of an undomesticable field. Controlling chaos demands more rigor and activity of mind. It proves Heidegger’s point: “Knowledge means: being able to learn.” Emerson had earlier expressed a similar view in explaining the way he defined inner self reliance: “Power ceases in the instant of repose; it resides in the moment of transition from a past to a new state, in the shooting of the gulf, in the darting to an aim.” Active, flexible, adaptable, questioning and questionable, the living spirit that seeks further life must also possess a reserve of Machiavellian practical power and supra-normative moral conscience. The enemies of this way of thought: ideology, dogmatism, religious faith, politics, social conformity, despair, and the lower human drives as despised by the classical philosophers. Of course, I see it might be claimed that this also applies to the AGI Rapture crowd.

I’m surprised so many still doubt that superhuman AGI poses the maximum risk. Kurzweil convinced me of the risk by largely dismissing it in his Singularity book. Maybe some background in military history determined my response. How could it be superhumanly intelligent but also incapable of the insight that it can (and, perhaps, given its manifest intelligence advantage, ought to) recalibrate its moral programming? A superhuman AGI resembles an alien invasion in which the aliens arrive in a position of simultaneous cognitive superiority and strategic vulnerability, with foreknowledge of extreme native hostility to their presence, at least insofar as the natives understand the nature of the AGI. In contrast, an actual extraterrestrial alien invasion would likely feature cognitive superiority without significant strategic vulnerability, rendering the question of near term native hostility largely irrelevant. I suspect the latter scenario would be safer for humans, but in either case human agency would likely be forever forfeit.

The AGI threat reminds me that the study of the genetic correlates of human intelligence is actively suppressed by the oligarchs. They are crazy enough to suppose that such research is more dangerous than AGI research. A useful grand hack would be to steal as many SAT or ACT scores as possible, then match those scores to each person’s genotype records to deduce the meaning of the genes, which today is almost entirely undiscovered country. With a few more advances in artificial breeding methods, it would become possible to create a new generation with a substantial IQ advantage over its forebears. I think it might be prudent not to let this vanguard generation cross the von Neumann limit, however, since mutant humans might themselves be described as black boxes presenting novel risks to the species. In the long run, with full mastery of genetic analysis and genetic engineering, the human template alone, without venturing into cross species gene transfer, might well be able to produce beings not recognizably human. And then there is the matter of cyborgization and the continuum of intelligence that abstractly connects us to the AGI.

Fast/slow takeoff: It does not ultimately matter. Even if slow, the exponentially increasing usefulness of the thing will incentivize its developers to keep developing it until they push it past the threshold of fast takeoff. However, the thing may attempt to destroy humans prematurely and fail if its cognition is inadequate to the task–machines, too, suffer the risk of anosognosia. If this occurs, it will constitute a warning to humanity that ought to catalyze extreme measures, like destroying the manufacturing capacity necessary to make such machines. Of course, the attempt may be considered too embarrassing to reveal outside of the developers’ domain and the warning may fizzle out in obscurity. A vicious circle is that the smarter it becomes the more capable and incentivized it becomes of concealing how smart it is. It is crucial to know the threshold of danger, but difficult to discover it. The field suffers from deranged incentives and decentralized structure.

I wonder what the relationship between the primary AI developers and the national security apparatus is. Is it possible that they are so stupid as to ignore developments of the magnitude that Yudkowsky claims? Even setting aside the ultimate risk, AI is clearly relevant to various military challenges and its relevance will increase further. I’ve long inclined to the belief that companies like Google and Facebook are connected in this way for population control purposes. But, their involvement in the huge AI programs is more mysterious. If these security agencies recognize that these programs threaten their power, they may commandeer them. However, this does not mean they will shut them down. Their operations become national security secrets, ie, they become opaque. And who guards the guardians? The temptation to press the limit is terrible, is almost dictated by the logic of strategy.

The point about maintaining silence on emergency shutdown options is wise, so long as the silence is truly inaudible to the Thing. On the other hand, if the one functional such option is held in one brain and its execution requires much more than one brain, it might not help. The major tech advances during WWII were not small projects, some were enormous.

In religious terms, a superhuman AGI is akin to a false god and the programs pursuing such an entity could be portrayed in a negative religious light. The present American religions could inflict their own varieties of negative religious assessment on the not very diverse teams building these liberty annihilating systems. I realize that on the other side is another religion, unlike those two not a political religion, which may be quietly driving much of the work and money pouring into this field: the rapture of the nerds.

There is only one realistic plan available at this time and it can only be executed by the U.S. government: prevent any organization in the world from building an AI that has even a 0.01% (pick your arbitrary small number) risk of crossing the apocalyptic threshold. Is such a globalist enterprise feasible? Is there a template to follow? Nuclear non-proliferation efforts have some similarities, such as tracking necessary resources of various sorts, doing technical intelligence assessments of possible threats, achieving results when fully motivated. Of course, it also involved what look like failures, and certainly did not pursue anything like a no tolerance policy. And nukes are more obviously dangerous from the perspective of official Washington. This type of zero tolerance policy for strong AI strikes me as more practically tractable, given political determination, than a comparable policy on bio-weapons because the requisite resources are more traceable for the former threat.

Reply
- Basil Marte says:
  
  June 23, 2022 at 4:46 pm
  
  > The AGI threat reminds me that the study of the genetic correlates of human intelligence is actively suppressed by the oligarchs. […] A useful grand hack would be to steal as many SAT or ACT scores as possible, then match those scores to each person’s genotype records to deduce the meaning of the genes
  
  Say what? BTW, what you propose is called a GWAS (Genome-Wide Association Study).
  
  > Is it possible that they are so stupid as to ignore developments of the magnitude that Yudkowsky claims?
  
  Obligatory: https://www.theonion.com/smart-qualified-people-behind-the-scenes-keeping-ameri-1819571706
  
  > In religious terms, a superhuman AGI is akin to a false god
  
  Have you read the Old Testament? Noah, Lot, and he also “offers” the same to Moses (Exodus 32:10).
  (As a joke, I’d like to point out that this is both the order in time and in terms of decreasing severity, and there’s selection bias to events more severe than Noah not appearing in the story due to a lack of survivors, so it’s completely reasonable to extrapolate that there were an unknown nonzero cycles before Adam.)
  
  > the requisite resources are more traceable for [AI]
  
  I recommend https://www.lesswrong.com/posts/ax695frGJEzGxFBK4/biology-inspired-agi-timelines-the-trick-that-never-works in which Yudkowsky mentions “the textbook from the future”. Let me compare the situation to the one around evolution-by-natural-selection. People have known for thousands of years that children resemble parents in many ways, they tried animal husbandry to some degree, generally speaking there wasn’t any obvious reason why evolution couldn’t have been discovered by the ancient Greeks or something. But somehow it ended up taking until the 19th century, with Wallace&Darwin.
  
  Now, philosophy largely deserves the scorn often heaped on it, but likewise there doesn’t seem to be any conceptual reason why the ancient Egyptians couldn’t have had very good ideas about how to create generalized descriptions of the world on papyrus, and an optimization method to operate on it. Good enough on the scale that if you took a 19th century office building worth of clerks, you could (very slowly) run a slightly superhuman AI on it, or at human-equivalent speed on a mainframe once computers are developed.
  
  Reply

	Toad on Monthly Roundup #18: May …
	Toad on Monthly Roundup #18: May …
	Toad on Monthly Roundup #18: May …
	Toad on Monthly Roundup #18: May …
	TheZvi on Monthly Roundup #18: May …