A sharp left turn (SLT) is a possible rapid increase in AI system capabilities (such as planning and world modeling) that could result in alignment methods no longer working. This post aims to make the sharp left turn scenario more concrete. We will discuss our understanding of the claims made in this threat model, propose some mechanisms for how a sharp left turn could occur, how alignment techniques could manage a sharp left turn or fail to do so.
Claims of the threat model
What are the main claims of the “sharp left turn” threat model?
Claim 1. Capabilities will generalize far (i.e., to many domains)
There is an AI system that:
- Performs well: it can accomplish impressive feats, or achieve high scores on valuable metrics.
- Generalizes, i.e., performs well in new domains, which were not optimized for during training, with no domain-specific tuning.
Generalization is a key component of this threat model because we’re not going to directly train an AI system for the task of disempowering humanity, so for the system to be good at this task, the capabilities it develops during training need to be more broadly applicable.
Some optional sub-claims can be made that increase the risk level of the threat model:
Claim 1a [Optional]: Capabilities (in different “domains”) will all generalize at the same time
Claim 1b [Optional]: Capabilities will generalize far in a discrete phase transition (rather than continuously)
Claim 2. Alignment techniques that worked previously will fail during this transition
- Qualitatively different alignment techniques are needed. The ways the techniques work apply to earlier versions of the AI technology, but not to the new version because the new version gets its capability through something new, or jumps to a qualitatively higher capability level (even if through “scaling” the same mechanisms).
Claim 3: Humans can’t intervene to prevent or align this transition
- Path 1: humans don’t notice because it’s too fast (or they aren’t paying attention)
- Path 2: humans notice but are unable to make alignment progress in time
- Some combination of these paths, as long as the end result is insufficiently correct alignment