Refining the Sharp Left Turn threat model

(Coauthored with others on the alignment team and cross-posted from the alignment forum: part 1, part 2)

A sharp left turn (SLT) is a possible rapid increase in AI system capabilities (such as planning and world modeling) that could result in alignment methods no longer working. This post aims to make the sharp left turn scenario more concrete. We will discuss our understanding of the claims made in this threat model, propose some mechanisms for how a sharp left turn could occur, how alignment techniques could manage a sharp left turn or fail to do so.

Image credit: Adobe

Claims of the threat model

What are the main claims of the “sharp left turn” threat model?

Claim 1. Capabilities will generalize far (i.e., to many domains)

There is an AI system that:

  • Performs well: it can accomplish impressive feats, or achieve high scores on valuable metrics.
  • Generalizes, i.e., performs well in new domains, which were not optimized for during training, with no domain-specific tuning.

Generalization is a key component of this threat model because we’re not going to directly train an AI system for the task of disempowering humanity, so for the system to be good at this task, the capabilities it develops during training need to be more broadly applicable. 

Some optional sub-claims can be made that increase the risk level of the threat model:

Claim 1a [Optional]: Capabilities (in different “domains”) will all generalize at the same time

Claim 1b [Optional]: Capabilities will generalize far in a discrete phase transition (rather than continuously) 

Claim 2. Alignment techniques that worked previously will fail during this transition

  • Qualitatively different alignment techniques are needed. The ways the techniques work apply to earlier versions of the AI technology, but not to the new version because the new version gets its capability through something new, or jumps to a qualitatively higher capability level (even if through “scaling” the same mechanisms).

Claim 3: Humans can’t intervene to prevent or align this transition 

  • Path 1: humans don’t notice because it’s too fast (or they aren’t paying attention)
  • Path 2: humans notice but are unable to make alignment progress in time
  • Some combination of these paths, as long as the end result is insufficiently correct alignment

Arguments for the claims in this threat model

  • Claim 1: There is a “core” of general intelligence – a most effective way of updating beliefs and selecting actions (Ruin #22). Speculation about what the core is: consequentialism /  EU maximization / “doing things for reasons”. 
  • Claim 1a: Capability gains from intelligence are highly correlated (Ruin #15)
  • Claim 2: There is no simple core for alignment (Ruin #22), Corrigibility is anti-natural (Ruin #23)
  • Claims 1 & 2: arguments in Will capabilities generalize more? (+ Ruin #21)
  • Claim 3: We can’t coordinate to avoid AGI (Ruin #4)

Mechanisms for a sharp left turn

Mechanisms for capabilities generalizing far (Claim 1)

Capabilities will generalize far if learning / reasoning / cognitive work is done outside of the optimization process, similarly to how human cultural evolution happens outside genetic evolution and proceeds faster. Here are some mechanisms for cognitive work getting done outside the optimization process:

  • Mechanisms proposed by Paul Christiano: performing search internally, natural selection occurring internally, reasoning about how to think better, deploying subagents
  • The AI develops a way to outpace SGD as the way to learn / improve, similar to how humans outpace evolution by cultural transmission + within-lifetime learning. E.g., it writes Google docs to itself, or searches over prompts (or whatever is the prompt analog at that point). 
  • The optimization process finds a general meta-learning algorithm, which replaces the previous solution of domain-specific heuristics and greatly increases the rate of learning
  • Sequential reasoning: sharp increase in effective sequential memory/cognitive time between the “aligned” regime and the sharp left turn regime. Ways in which this can happen:
    • The model is just given access to memory. Its knowledge grows over time, and reaches a critical point which causes a sharp left turn. Alignment schemes before the critical point appear to work, but not after.
    • We try to restrict the model’s memory, but fail. E.g. the model’s outputs are used in the world, and the subsequent inputs carry information about previous outputs.
    • Some input causes the model to access a new faster mode of thinking (e.g. an explanation of how to do step-by-step reasoning in some domain). The same compute that on previous inputs powered e.g. one subjective day of thinking on this new input now powers one subjective week of thinking. This allows the model to evade alignment schemes.

Mechanisms for a rapid phase transition (Claim 1b)

A rapid phase transition happens if there is a capability overhang: the AI system is improving at various skills continuously, but its improvement in many domains is bottlenecked on one specific skill, and at some point it receives some input that makes its existing capabilities much more effective. Here are some ways this can happen: 

  • The system acquires situational awareness, and now its world model includes understanding of the system’s place in the world and how its training and deployment works. It can plan using this knowledge, which leads to more effective use of existing capabilities. 
  • Analogy to few-shot prompting: the capabilities are already present in the trained artifact. Any alignment technique that goes through gradient updates becomes irrelevant. Putting the artifact into the “right” situation (e.g., giving it a few-shot prompt) reveals its capabilities relevant to this situation. Mechanism: the relevant knowledge and capabilities are installed by some generic pretraining optimisation process.
  • Discovering a more effective way to make use of low quality data leads to more effective use of existing capabilities. 

Plan for managing a sharp left turn and how it could fail

Now we will outline our current understanding of the most promising plan for getting through an SLT and how it could fail (conditional on an SLT occurring).

Plan: we use alignment techniques to find a goal-aligned model before SLT occurs, and the model preserves its goals during the SLT. 

We can try to learn a goal-aligned model before SLT occurs: a model that has beneficial goals and is able to reason about its own goals. This requires the model to have two properties: goal-directedness towards beneficial goals, and situational awareness (which enables the model to reason about its goals). Here we use the term “goal-directedness” in a weak sense (that includes humans and allows incoherent preferences) rather than a strong sense (that implies expected utility maximization). 

One can argue that the goal-aligned model has an incentive to preserve its goals, which would result in an aligned model after SLT. Since preserving alignment during SLT is largely outsourced to the model itself, arguments for alignment techniques failing during an SLT don’t imply that the plan fails (e.g. it might be fine if interpretability or ELK techniques no longer work reliably during the transition if we can trust the model to manage the transition). 

Step 1: Finding a goal-aligned model before SLT

We want to ensure that the model is goal-oriented with a beneficial goal and has situational awareness before SLT. It’s important that the model acquires situational awareness at the right time: after it acquires beneficial goals. If situational awareness arises when the model has undesirable goals, this leads to deceptive alignment. 

Thus, our model search process would follow a decision tree along these lines:

  • If situational awareness is detected without goal-directedness, restart the search. 
  • If undesirable goal-directedness or early signs of deceptive alignment are detected, restart the search. 
  • If an upcoming phase transition in capabilities is detected, and the model is not goal-aligned, restart the search. 
  • If beneficial goal-directedness is detected without situational awareness, train the model for situational awareness. 

Here are some ways alignment techniques could be applied in this plan: 

  • Use ELK techniques to identify whether the model has situational awareness (i.e. whether it believes it’s inside a training process) and what goals it believes itself to have.
  • Use interpretability techniques to pick up on early signs of deceptive alignment and restart and improve the search. (If we develop more advanced interpretability techniques that make it possible to retarget the internal search of the model, we can retarget a goal-directed model to a beneficial goal instead of restarting the search.)
  • Use foundational progress on abstraction to determine whether the goals learned by the system are beneficial. 
  • Use foundational research insights (e.g. selection theorems) to make the search process more effective at avoiding models with undesirable goals and finding goal-aligned models
  • Use insights from studying capability transitions to predict sharp increases in problematic capabilities, and make sure we only cross that threshold after we found a goal-aligned model using the other techniques on this list.

Why alignment techniques could fail to find a goal-aligned model before SLT occurs:

  • The space of deceptively aligned models is much bigger than the space of goal-aligned models, so the latter is very hard to find.
  • The inductive biases of the search favor misaligned models (see How likely is deceptive alignment? and Will Capabilities Generalise More? for arguments). Note that these arguments indicate that the alignment techniques have to fight an uphill battle to find a goal-aligned model, rather than providing a mechanism for why they would fail. 
  • Predicting capability transitions does not tell us when it will happen or whether the model undergoing the transition would be aligned. If we use these predictions to restart the search, we could rule out goal-aligned models as well.
  • Interpretability is very hard, current methods are low-level and it’s unclear how to get them to work on the high level (e.g. identifying complex goals and beliefs of a system rather than how it stores basic facts about the world). 

It’s also possible that this model search process would pose an overly high alignment tax to be followed in practice (see Claim 3). 

Step 2: The goal-aligned model preserves its goals during SLT (with some help from us)

This step of the plan relies on the goal-aligned model both wanting to preserve its goals during an SLT and being capable of preserving its goals. The general argument for a model wanting to preserve its own goals is that it has a convergent instrumental incentive to ensure that future versions of itself act in ways that satisfy its current preferences. 

One argument for a pre-SLT system being capable of preserving its goals is by analogy to human values being preserved when humans went through the capability phase transition of developing language and writing. However, it is not clear how much the human analogy constitutes evidence for ability to preserve goals through a phase transition, since there are other possible explanations for human values being preserved (e.g. that values like “relationships” and “social status” continued to be selected for during the capability transition) that don’t rely on humans being able to deliberately preserve their values.

A key question is whether the model could reflect on its goals without already having generalized capabilities. This is related to Claim 1a in our previous post: that the model’s capabilities will all generalize at the same time. If this claim holds, in the sense that reflection on goals is part of this set of capabilities that all generalize simultaneously, then the ability to reflect on goals would only arise during the SLT and so this plan doesn’t work. However, it seems more likely that some form of goal-reflection will arise before generalized planning ability, because making predictions about how your goals could change is easier than achieving them. 

  • Humans are a proof of concept for this – many people can intuitively predict how some experiences can change their preferences (e.g. “going to Vegas might make me a gambler”) without being generally good at planning or world modeling (which require abstract thinking). 
  • Large language models may be an example of this as well, since they have some capacity to reflect on goals (if prompted accordingly) without generalized planning ability. 

If the model can predict that an SLT is likely to modify its goals in an undesirable way, we expect that it would choose not to undergo an SLT in the first place (e.g. by avoiding new internal algorithms or ways of thinking that could lead to an SLT). 

Paul Christiano outlined some specific ways for a goal-aligned model to preserve its goals during SLT depending on the mechanism for SLT:

  • “Aligning the internal search [or natural selection inside the model] seems very similar to aligning SGD on the outside. […] Because the search is on the inside, we can’t directly apply our alignment insights to align it. Instead we need to [use ELK to] ensure that SGD learns to align the search.”
  • “If our model is selecting cognitive actions, or designing new algorithms, then our core hope is that an aligned model will try to think in an aligned way. So if we’ve been succeeding at alignment so far then the model will be trying to stay aligned.”
  • “One way this can go wrong is if our model wants to stay aligned but fails, e.g. because it identifies new techniques for thinking that themselves pose new alignment difficulties (just as we desire human flourishing but may instead implement AI systems that want paperclips). […] If you’ve succeeded at alignment so far, then your AI will also consider this a problem and will be trying to solve it. I think we should relate to our AI, discovering new ways to think that might pose new alignment difficulties, in the same way that we relate to future humans who may encounter alignment difficulties. The AI may solve the problem, or may implement policy solutions, or etc., and our role is to set them up for success just like we are trying to set up future humans for success.” 

We also consider how important it would be for the goal-preservation process to go exactly right. If the SLT produces a strongly goal-directed model that is an expected utility maximizer, then the process has to hit a small set of utility functions that are human-compatible to maximize. However, it is not clear whether SLT would produce a utility maximizer. Returning to the example of humans undergoing an SLT, we can see that getting better at planning and world modeling made them more goal-directed but not maximally so (even with our advanced concepts and planning, we still have lots of inconsistent preferences and other limitations). It seems plausible that coherence is really hard and an SLT would not produce a completely coherent system. 

Some ways a goal-aligned model could fail to preserve its goals:

  • Model is bad at introspection and does not detect subtle changes to its goals.
  • Model is bad at planning and doesn’t foresee how some modifications will eventually lead to changes in its goals. A special case of this is if the model becomes an expected utility maximizer during the SLT and the goal-preservation process does not produce a utility function that is ok to maximize. 
  • Effectively preserving beneficial goals might require the model to have a deep understanding of what humans want (rather than just a good enough proxy for that) which the model would likely not have before the capability transition. 

Some ways that humans could fail to help the model to preserve its goals:

  • We can’t reliably tell whether the model learned to align the internal search. 
  • We can’t communicate with the model well enough to help it overcome internal alignment difficulties.

Takeaways

The above is our current model of the most promising plan for managing an SLT and how it could fail. The overall takeaways are:

  • To evaluate how well any given alignment proposal addresses the SLT threat model, it’s useful to consider whether it aims to align a model before or during SLT.
  • It seems more promising to look for a goal-aligned model before SLT and help it stay aligned, rather than trying to align a possibly misaligned model during SLT. 

The core reasons to be skeptical of this plan are: 

  • (Step 1) Goal-aligned models are hard to find (and we may not develop good enough alignment techniques to find one)
  • (Step 2) Generalized planning may be required for goal-reflection (in which case goal-reflection would only arise during an SLT)
  • (Step 2) We may not be able to tell what’s going on inside the model to help it stay aligned. 

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s